Problem #1

A large medical clinic would like to forecast daily patient visits for purposes of staffing. a. If data is available only for the last month, how does this affect the choice of model-based vs. data-driven methods? In this case we are trying to forecast daily patient visits for a specific day, which means for any particular day (Sunday-Saturday), there are only about 4 datapoints. In this case we might want to use more model-based methods since they are more advantageous when the series at hand is very short. Few data points are needed for estimating model parameters if the underlying model is already chosen. In a data-driven approach, algorithms learn patterns from the data and require more data for adequate learning. However, naive forecasts are also data-driven and would simply use the last data point in the series for generating a forecast. In this case, although there are only about 4 datapoints for any particular day of the week, that would still be adequate for a naive model which uses only the last datapoint in the series.

The clinic has access to the admissions data of a nearby hospital. Under what conditions will including the hospital information be potentially useful for forecasting the clinic’s daily visits? Here we could be using a causal econometric model which would be based on the assumption of some type of causality between the two data sets. Or as an alternative, we could capture the associations or correlations between the two data sets and model them into the forecasting method. Finally. we could correlate the external hospital data with the clinical data heuristically to forecast some measure of daily patient visits.
Thus far, the clinic administrator takes a heuristic approach, using the visit number from the same day of the previous week as a forecast. What is the advantage of this approach? The advantage of using this approach is that it takes advantage of a possible relationship between the two series that might not otherwise be used when modeling the series using only it own historical values. What is the disadvantage? The disadvantage is that this hospital data may not always be available when a prediction is to be made.
What level of automation appears to be required for this task? Explain. Not a lot. Here we are predicting the expected daily patient visits and there is only one month or 4 weeks of data. Thus, there is not much data to be modeled from and the output is finite - the seven days of the week - thus a constant re-evaluation periodically of the output is not required.
Describe two approaches for improving the current heuristic (naive) forecasting approach using ensembles. The first ensemble approach would be to use multiple methods on the time series and then average the results to produce the final forecast. The second ensemble approach would be to model the two different series which measure the same phenomenon of interest (visits). We would measure the performance of this ensemble by evaluating the forecasts from the month of clinic data against the forecast from the hospital data.

Problem #2

The ability to scale up renewable energy, and in particular wind power and speed, is dependent on the ability to forecast its short-term availabilty. Soman et ak, (2010) describe different methods for wind power forecasting:

Persistence Method Physical Approach Statistical Approach Hybrid Approach

For each of the four types of methods, describe whether it is model-based, data-driven, or a combination. The persistence approach is known as a naive predictor and uses the data-driven approach of using a previous data point in the series to generate a forecast. The physical approach which uses a parameterization of the physical characteristics of the atmosphere to then predict the short-term power availability is more model driven since these parameterization data points will exist throughout the data period and the model with these estimated parameters will be used to generate forecasts. The Statistical Approach uses a more data-driven method in that local pattern changes are used to build the model. Since these local models do not extend throughout the time series, a model driven approach would be impractical in determining when the pattern changes. The hybrid approach would be a combination of the model-based and data driven approaches as it combines the model driven physical approach with the data-driven statistical approach.
For each of the four types of methods, describe whether it is based on extrapolation, causal modeling, correlation modeling or a combination. The persistence method which is a naive predictor relies on its own history to forecast and thus is an extrapolation method. The physical approach uses parameterization to relate causality or association between atmospheric conditions and wind power availability so this is causal modeling. The statistical approach is based on patterns in the past error between actual and predicted wind speeds to tweak parameters. This approach is thus technically a extrapolation method since the measurement data is used to predict wind speeds and then the difference between the prior actual and predicted wind speeds (which is “past error” data) is used to tweak the model parameters. However this error is its own time series derived from actual data and a model, thus one would not be incorrect to say this is a combination method using two-level methods using the original time series to generate forecasts of future values and the second method to generate forecasts of future forecast errors. However, here we are not forecasting future forecasting errors as much as tweaking our model parameters and thus the model itself to take into account the causal or correlative effect of the new error based time series as we move forward in time. One could argue that this is possibly also an example of causal modeling or even correlative if the error time series is considered to be external information that correlates with the target series, which means that we are using a combination of extrapolative, causal and correlative modeling. The hybrid approach as it uses physical and statistical approaches would be considered an example of a combination of extrapolative, causal and correlative as per the model types that constitute the physical and statistical approaches.
Describe the advantages and disadvantages of the hybrid approach. In the hybrid approach, we are combining methods which can turn out to be simpler than trying to find the perfect single model and in addition can take advantage of the capability of different forecasting methods to capture different aspects of the time series. However, combining methods generally have increased costs, require analysts who are familiar with different methods of forecasting and require that a predetermined rule is agreed upon for combining to avoid biases.

Submission of Project Ideas

Thomas Beretich, Sean McGowan and Andrew Deighan

Emergency Department Volume - Sean McGowan

First, what is the “problem or question” you are trying to solve or answer? You should be able to put this idea into a single sentence.

-The Maine Med Emergency Department has been experiencing high patient volumes which is causing issues with staffing and beds.

Second, why is this problem both important and interesting? The problem should be based on a real struggle or “pain point” of a business or an organization and can be analyzed using time series analysis and forecasting.

Wait times in the emergency room are longer, care quality goes down which negatively affects reimbursement to the hospital.

Third, because everything we do in the class relies on data, where and how did you obtain the necessary time series data to utilize the tools and models we covered in class?

We are able to get access to patient volume numbers since January 2013.

Fourth, has a similar problem been solved by others? Use your resources at the library to research previous peer-reviewed journal articles that are relevant to your project.

Yes there have been other studies done to predict ED volume.

Wikipedia Forecasting Competition - Andrew Deighan

The goal of this competition is to create a model that accurately predicts future traffic on Wikipedia pages. The forecast scope includes 145,000 different Wikipedia articles and the forecast horizon is 2 months. For each of the 145,000 articles, there is a time-series of the number of daily views of that article. The competition website provides this data from July 1st, 2015 through September 1st, 2017. The competition website does not specifically state why Wikipedia wants to be able to accurately predict traffic 2 months out. However, I would think it would be important in order to properly plan resource allocation. I’m not well versed on the technical aspects of how the internet works, but I’m under the impression that servers can only handle certain loads of traffic. The competition is over, but the data is still freely available and this could be a very interesting project.

M4 Time Series Modeling Competition - Thomas Beretich

-The purpose of the M4-Competition is to replicate the results of the previous 3 competitions whose purpose was to identify the most accurate forecasting method(s) for different types of predictions.

Why is this problem both important and interesting? The problem should be based on a real struggle or “pain point” of a business or an organization and can be analyzed using time series analysis and forecasting.

Probably the most interesting part of this problem is that it is a competition and participants will be comparing their models against many other competitors. That aside, the analysis of different approaches to modeling will be a very important outcome.

Because everything we do in the class relies on data, where and how did you obtain the necessary time series data to utilize the tools and models we covered in class?

The data is being furnished by the competition organizers.

Has a similar problem been solved by others? Use your resources at the library to research previous peer-reviewed journal articles that are relevant to your project.

Yes, this is the fourth competition by the same organizers and it replicates the organization of the competitions done previously.

Homework Assignment #3 MBA 678 - Professor Matthew Dean

Thomas Beretich

February 19, 2018