Objective
Method
Questions and Results
- Question 1: For varying weather variables for the same station?
- Question 2: For varying skipping time lengths?
Results
Future Work

Objective

We are interested in conducting 1-Day ahead surface solar forecasting using probability models. For this report, we chose to use Schofield Barracks (SCBH1) as target station to forecast surface solar radiation.

alt text

The interactive version can be found clicking here.

As noted on the map, we chose SCBH1 given it’s location on the center of the island. Furthermore, SCBH1 is the station closest to the center with a sensor that samples solar radiation itensity (W/m^2) (indicated by a sun on the figure). Luckly, SCBH1 also has a suitable time length of data (starting in 2002 up until the day of this report, 2015).

We can also note from the map that 3 stations to the west, south, and east contain solar data. The north stations unfortunately contain only 2015 solar data and therefore are not considered further. Namely,the closest stations surrounding SCBH1 with solar data are:

OFRH1 (East)
PLHH1 (South)
WNVH1 (West)

The weather variables available on them are as follows:

Precipitation (PREC)
Solar (SOLR)
Humidity (RELH)
Dew Point (DWPF)
Wind Speed (SKNT)
Temperature (TMPF)

The remainder of this work is interested in observing if any noticiable improvement on the forecasting performance is observable. In summary, the following parameters must also be taken into account and may affect the performance of the model:

Use 1 or more stations, parameterized by their distance.
Use 1 more weather variables from the same or different stations.
Using 1 or more previous days (amount of consecutive days) for each weather variable
Using the same or different discretization methods for each weather variable
Use 1 more training data years.
Using varying seasonality
Use complete or partial observation data cleaning for a given weather variable (patterns of missing data)
Using varying methods for the probability method (argmax, weighted sum, etc.)

We discuss next where specifically on the definition of the probability model those questions are instantiated as a set of parameters.

Method

alt text

Formula: Parameters (1) to (3).

#Specify the model
formula <- "SCBH1.SOLR_Daym0|PLHH1.SOLR_Daym1,WNVH1.SOLR_Daym1,OFRH1.TMPF_Daym2"

The formula above use our definition of features. It specifies a probability model whose feature of interest is the solar radiation at the current day on station SCBH1, given it is known the solar radiation of the same station on the previous day, and the precipitation 2 days before.

Specifically, the formula parameterizes questions (1) to (3). It is also important to note that the amount of combinations beyond 3 days often leads to poor performance due to the possible amount of combinations. We defer to a further report investigating the trade-off between reducing the amount of centroids, time length and amount of consecutive days.

Discretization: Parameter (4).

We currently manually specify per weather variable what is the most appropriate discretization method for each weather variable and the associated number of centroids. Currently kmeans and average min-max bining are available. The following is the current configuration of parameters.

#Defines how to discretize each weather variable
model[["discretize"]] <- list()
#Precipitation
model[["discretize"]][["PREC"]][["method"]] <- "binning"
model[["discretize"]][["PREC"]][["ncentroids"]] <- 5
#Solar
model[["discretize"]][["SOLR"]][["method"]] <- "kmeans"
model[["discretize"]][["SOLR"]][["ncentroids"]] <- 5
#Humidity
model[["discretize"]][["RELH"]][["method"]] <- "kmeans"
model[["discretize"]][["RELH"]][["ncentroids"]] <- 5
#Dew Point
model[["discretize"]][["DWPF"]][["method"]] <- "kmeans"
model[["discretize"]][["DWPF"]][["ncentroids"]] <- 5
#Wind Speed
model[["discretize"]][["SKNT"]][["method"]] <- "kmeans"
model[["discretize"]][["SKNT"]][["ncentroids"]] <- 5
#Temperature
model[["discretize"]][["TMPF"]][["method"]] <- "kmeans"
model[["discretize"]][["TMPF"]][["ncentroids"]] <- 5
#station.train <- sapply(station,filter.year.interval,2012:2013)
#station.test <- sapply(station,filter.year.interval,2014)

Time Length: Parameter (5).

Lastly we specify the time length for the training data and pass as a parameter along with the loaded data, formula and method of discretization. If a test length is specified, the model will also output the error of the forecast.

set.seed(1234)
pmodel <- joint_probability(
    data=model[["data"]],
    formula=model[["formula"]],
    discretize=model[["discretize"]],
    time.length=2012:2013,
    test.length=2014)

For the remainder of this work, the parameters are used as previously specified, aside from those defined by the formula which will vary according to the questions we will pose and answer on the following section.

Questions and Results

Like indicated before on the objective. The following question specifies a set of parameters to investigate if any significant diffence occur on the model performance. Hence the following questions are prefixed by “Does the model performance improve..”

Question 1: For varying weather variables for the same station?

Station SCBH1

We define the following formulas to answer this question:

formulas <- c(
                "SCBH1.SOLR_Daym0|SCBH1.SOLR_Daym1",
                "SCBH1.SOLR_Daym0|SCBH1.PREC_Daym1",
                "SCBH1.SOLR_Daym0|SCBH1.RELH_Daym1",
                "SCBH1.SOLR_Daym0|SCBH1.DWPF_Daym1",
                "SCBH1.SOLR_Daym0|SCBH1.SKNT_Daym1",
                "SCBH1.SOLR_Daym0|SCBH1.TMPF_Daym1"
            )

SOLR	PREC	RELH	DWPF	SKNT	TMPF
1750.395	2477.534	2315.263	2627.718	2633.518	1952.693

Station OFRH1 (East)

formulas <- c(
                "SCBH1.SOLR_Daym0|OFRH1.SOLR_Daym1",
                "SCBH1.SOLR_Daym0|OFRH1.PREC_Daym1",
                "SCBH1.SOLR_Daym0|OFRH1.RELH_Daym1",
                "SCBH1.SOLR_Daym0|OFRH1.DWPF_Daym1",
                "SCBH1.SOLR_Daym0|OFRH1.SKNT_Daym1",
                "SCBH1.SOLR_Daym0|OFRH1.TMPF_Daym1"
            )

SOLR	PREC	RELH	DWPF	SKNT	TMPF
2820.72	1962.721	2813.887	2721.729	1636.789	2396.521

Station PLHH1 (South)

formulas <- c(
                "SCBH1.SOLR_Daym0|PLHH1.SOLR_Daym1",
                "SCBH1.SOLR_Daym0|PLHH1.PREC_Daym1",
                "SCBH1.SOLR_Daym0|PLHH1.RELH_Daym1",
                "SCBH1.SOLR_Daym0|PLHH1.DWPF_Daym1",
                "SCBH1.SOLR_Daym0|PLHH1.SKNT_Daym1",
                "SCBH1.SOLR_Daym0|PLHH1.TMPF_Daym1"
            )

SOLR	PREC	RELH	DWPF	SKNT	TMPF
2818.312	2360.254	2072.922	2844.57	2728.284	2918.546

Station WNVH1 (West)

formulas <- c(
                "SCBH1.SOLR_Daym0|WNVH1.SOLR_Daym1",
                "SCBH1.SOLR_Daym0|WNVH1.PREC_Daym1",
                "SCBH1.SOLR_Daym0|WNVH1.RELH_Daym1",
                "SCBH1.SOLR_Daym0|WNVH1.DWPF_Daym1",
                "SCBH1.SOLR_Daym0|WNVH1.SKNT_Daym1",
                "SCBH1.SOLR_Daym0|WNVH1.TMPF_Daym1"
            )

SOLR	PREC	RELH	DWPF	SKNT	TMPF
1983.808	2825.611	2724.506	2403.816	2342.721	2598.258

Question 2: For varying skipping time lengths?

We vary here the skipping time (2012,2013), (2011,2012), etc to understand if the model could benefit of being online.

For 1 Consecutive Day.

formula <- "SCBH1.SOLR_Daym0|SCBH1.SOLR_Daym1"

time.lengths <- c(2003,
                  2004,
                  2005,
                  2006,
                  2007,
                  2008,
                  2009,
                  2010,
                  2011,
                  2012,
                  2013
    )

2003,2004	2004,2005	2005,2006	2006,2007	2007,2008	2008,2009	2009,2010	2010,2011	2011,2012	2012,2013
1689.538	1577.491	1570.457	1563.433	1627.389	1579.865	1628.269	1739.696	1657.253	1714.256

For 2 Consecutive Days.

formula <- "SCBH1.SOLR_Daym0|SCBH1.SOLR_Daym1,SCBH1.SOLR_Daym2"

time.lengths <- c(2003,
                  2004,
                  2005,
                  2006,
                  2007,
                  2008,
                  2009,
                  2010,
                  2011,
                  2012,
                  2013
    )

2003,2004	2004,2005	2005,2006	2006,2007	2007,2008	2008,2009	2009,2010	2010,2011	2011,2012	2012,2013
1708.717	2016.282	1714.291	1711.572	1613.302	1594.277	1648.36	2564.848	1786.643	1856.056

For 3 Consecutive Days.

formula <- "SCBH1.SOLR_Daym0|SCBH1.SOLR_Daym1,SCBH1.SOLR_Daym2,SCBH1.SOLR_Daym3"

time.lengths <- c(2003,
                  2004,
                  2005,
                  2006,
                  2007,
                  2008,
                  2009,
                  2010,
                  2011,
                  2012,
                  2013
    )

2003,2004	2004,2005	2005,2006	2006,2007	2007,2008	2008,2009	2009,2010	2010,2011	2011,2012	2012,2013
2049.321	2139.79	1887.671	1760.594	1733.617	1687.404	1711.279	2349.581	1846.352	1954.925

Results

Question 1

It is interesting to observe that the best model out of all possible combinations of the weather variables and stations listed was by “SCBH1.SOLR_Daym0|OFRH1.SKNT_Daym1” ( Daily Average Error of 1636 W/m^2). This is followed by “SCBH1.SOLR_Daym0|SCBH1.SOLR_Daym1” (Daily Average Error of 1740 W/m^2). SKNT stands for the Wind Speed Variable. Furthermore, OFRH1 is the most distant station across all the 3 used for the model.

Question 2

If we vary the time length in pairs of years going back to the past, and consider the amount of consecutive days from 1 (previous day), to 3 (previous days), then the best model performance occurs when the model is trained with the pair 2007,2008 (Daily Average Error of 1560 (W/m^2)) using only 1 consecutive day. For 1 consecutive day, the most recent available data results in 1710 (W/m^2) error units.

It is also interesting to note that the best and worst training set is inconsistent across different amount of consecutive days. Hence, we we are led to believe an online model yearly based, i.e. that always has available the latest year of data, would not necessarily yield the best performance.

Overall

Overall the following experiment didn’t observe any significant error reduction, and the error ranged between 1500 to 1700 units which is similar to what has been reported in our published work.

Future Work

For the following questions, I intend to reduce the 3 displayed previous plots into a scatterplot of errors. This will make the comparison easier, given that a combination of station and weather variable model will result in just a new point in the plot. I will then add more combinations of observed variables, and create different scatterplots by varying the time length. Once put side by side, we should be easily able to assess parameters (1) to (5) and be able to judge clearly if the margin of 15% error can be surpassed.

For question 1, other combinations of models needs to be tested mixing weather variables, and for question 2 investigating wether an online model that attempts to forecast the next day for a smaller future time frame seems pertinent (i.e. instead of having available the last full year of data to forecast the next 1 full year, attempt to just have the most current day of data to forecast just the next day).

Feature Probability Model

Carlos V. A. Silva

September 7, 2015