A large medical clinic would like to forecast daily patient visits for purposes of staffing.

1a. If data is only available for the last month, how does this affect the choice of model-based vs. data-driven methods?

The length of the time series necessitates a model-based method because there isn’t enough data for a data-driven method. Since one month alone can’t reflect seasonality, a model could take this data and combine it with data obtained elsewhere and begin to transform it.

1b. The clinic has access to the admissions data of a nearby hospital. Under what conditions will including the hospital information be potentially useful for forecasting the hospital’s daily visits?

There are several caveats here with many things unclear, including if there is enough data available from the other hospital to gauge seasonality, whether that hospital is comparable in capacity and clientele and whether the other hospital collects its data in the same way that our hospital does. If these answers are affirmative, it could be valuable to forecast fluctuations in patient counts using the meager data from our hospital.

1c. Thus far, the clinic administrator takes a heuristic approach, using visit numbers from the same day of the previous week as a forecast. What is the advantage of this approach? What is the disadvantage?

Based on the limited data she has, using the same day of a previous week may be the best way easily available to plan for the number of patients coming through the door today. However, there’s no context for last week’s data, so she doesn’t know objectively whether it was a busy, average or slow time at the hospital.

1d. What level of automation appears to be required for this task? Explain.

Model-based methods are harder to automate than data-driven methods, but this seems to be a task that requires at least a moderate level of automation. The forecast must be continuous to evaluate staffing levels each day and the hospital doesn’t seem to have much forecasting expertise. In this case, a model-based method would require regular checking to see if it’s sticking relatively well to observed patterns.

1e. Describe two approaches for improving the current heuristic (naive) forecasting approach using ensembles.

1. Simple weighted averaging: This could begin a model that transforms our hospital’s data for the past week by weighting it according to fluctuations in the other hospital’s figures — for the same time of year and going back as many years as possible. That would create a series of predicted values for each day of past years that could be averaged to estimate visitation day by day into the future. It’s slightly more advanced than the current heuristic method, but it’s still highly speculative because we don’t know what our past visitation data is.

2. A more advanced multi-level analysis: This would use the weighted average work done in the previous method as a jumping-off point to compare values for each day in the past to current observed values for certain days. The differences — or errors — could be measured with another model so the model can be adjusted to get predictions closer to observed values than the other method could. It’s a much better approach that will always improve with more time and data.

2a. For each of the four types of methods for predicting wind power and speed (persistence, physical, statistical and hybrid), describe whether they’re model-based, data-driven or a combination.

Persistence is data-driven, since it uses past speeds at a certain time to predict future speeds.

Physical is model-based, since it creates speed data based on current atmospheric conditions.

Statistical is data-driven. The giveaway here is that it isn’t based on any predefined model and relies on patterns in immediate past wind speeds. Patterns are best detected by data-driven methods.

Hybrid is a combination of the two, since it mixes physical and statistical methods.

2b. For each method, describe whether it’s based on extrapolation, causal modeling, correlation or a combination.

Persistence is based on correlation, since it links future speeds to past speeds.

Physical is based on causal modeling, since it assumes that other atmospheric conditions cause changes in wind speeds.

Statistical is based on extrapolation, since it uses past training data as a base for modeling wind speeds.

Hybrid is based on a combination of causal modeling and extrapolation, since it — again — mixes physical and statistical methods.

2c. Describe the advantages and disadvantages of the hybrid approach.

A major advantage of the hybrid approach is that it’s guaranteed to have a higher level of certainty and robustness than either the physical and statistical models alone. However, it is harder to do, taking people who are experts in various methods and increasing costs.