Objective

We are interested in conducting 1-Day ahead surface solar forecasting using probability models. For this report, we chose to use Schofield Barracks (SCBH1) as target station to forecast surface solar radiation.

alt text

The interactive version can be found clicking here.

As noted on the map, we chose SCBH1 given it’s location on the center of the island. Furthermore, SCBH1 is the station closest to the center with a sensor that samples solar radiation itensity (W/m^2) (indicated by a sun on the figure). Luckly, SCBH1 also has a suitable time length of data (starting in 2002 up until the day of this report, 2015).

We can also note from the map that 3 stations to the west, south, and east contain solar data. The north stations unfortunately contain only 2015 solar data and therefore are not considered further. Namely,the closest stations surrounding SCBH1 with solar data are:

The weather variables available on them are as follows:

The remainder of this work is interested in observing if any noticiable improvement on the forecasting performance is observable. In summary, the following parameters must also be taken into account and may affect the performance of the model:

  1. Use 1 or more stations, parameterized by their distance.
  2. Use 1 more weather variables from the same or different stations.
  3. Using 1 or more previous days (amount of consecutive days) for each weather variable
  4. Using the same or different discretization methods for each weather variable
  5. Use 1 more training data years.
  6. Using varying seasonality
  7. Use complete or partial observation data cleaning for a given weather variable (patterns of missing data)
  8. Using varying methods for the probability method (argmax, weighted sum, etc.)

We discuss next where specifically on the definition of the probability model those questions are instantiated as a set of parameters.

Method

alt text

Formula: Parameters (1) to (3).

#Specify the model
formula <- "SCBH1.SOLR_Daym0|PLHH1.SOLR_Daym1,WNVH1.SOLR_Daym1,OFRH1.TMPF_Daym2"

The formula above use our definition of features. It specifies a probability model whose feature of interest is the solar radiation at the current day on station SCBH1, given it is known the solar radiation of the same station on the previous day, and the precipitation 2 days before.

Specifically, the formula parameterizes questions (1) to (3). It is also important to note that the amount of combinations beyond 3 days often leads to poor performance due to the possible amount of combinations. We defer to a further report investigating the trade-off between reducing the amount of centroids, time length and amount of consecutive days.

Discretization: Parameter (4).

We currently manually specify per weather variable what is the most appropriate discretization method for each weather variable and the associated number of centroids. Currently kmeans and average min-max bining are available. The following is the current configuration of parameters.

#Defines how to discretize each weather variable
model[["discretize"]] <- list()
#Precipitation
model[["discretize"]][["PREC"]][["method"]] <- "binning"
model[["discretize"]][["PREC"]][["ncentroids"]] <- 5
#Solar
model[["discretize"]][["SOLR"]][["method"]] <- "kmeans"
model[["discretize"]][["SOLR"]][["ncentroids"]] <- 5
#Humidity
model[["discretize"]][["RELH"]][["method"]] <- "kmeans"
model[["discretize"]][["RELH"]][["ncentroids"]] <- 5
#Dew Point
model[["discretize"]][["DWPF"]][["method"]] <- "kmeans"
model[["discretize"]][["DWPF"]][["ncentroids"]] <- 5
#Wind Speed
model[["discretize"]][["SKNT"]][["method"]] <- "kmeans"
model[["discretize"]][["SKNT"]][["ncentroids"]] <- 5
#Temperature
model[["discretize"]][["TMPF"]][["method"]] <- "kmeans"
model[["discretize"]][["TMPF"]][["ncentroids"]] <- 5
#station.train <- sapply(station,filter.year.interval,2012:2013)
#station.test <- sapply(station,filter.year.interval,2014)

Time Length: Parameter (5).

Lastly we specify the time length for the training data and pass as a parameter along with the loaded data, formula and method of discretization. If a test length is specified, the model will also output the error of the forecast.

set.seed(1234)
pmodel <- joint_probability(
    data=model[["data"]],
    formula=model[["formula"]],
    discretize=model[["discretize"]],
    time.length=2012:2013,
    test.length=2014)

For the remainder of this work, the parameters are used as previously specified, aside from those defined by the formula which will vary according to the questions we will pose and answer on the following section.

Questions and Results

Like indicated before on the objective. The following question specifies a set of parameters to investigate if any significant diffence occur on the model performance. Hence the following questions are prefixed by “Does the model performance improve..”

Question 1: For varying weather variables for the same station?

Station SCBH1

We define the following formulas to answer this question:

formulas <- c(
                "SCBH1.SOLR_Daym0|SCBH1.SOLR_Daym1",
                "SCBH1.SOLR_Daym0|SCBH1.PREC_Daym1",
                "SCBH1.SOLR_Daym0|SCBH1.RELH_Daym1",
                "SCBH1.SOLR_Daym0|SCBH1.DWPF_Daym1",
                "SCBH1.SOLR_Daym0|SCBH1.SKNT_Daym1",
                "SCBH1.SOLR_Daym0|SCBH1.TMPF_Daym1"
            )    

SOLR PREC RELH DWPF SKNT TMPF
1750.395 2477.534 2315.263 2627.718 2633.518 1952.693

Station OFRH1 (East)

formulas <- c(
                "SCBH1.SOLR_Daym0|OFRH1.SOLR_Daym1",
                "SCBH1.SOLR_Daym0|OFRH1.PREC_Daym1",
                "SCBH1.SOLR_Daym0|OFRH1.RELH_Daym1",
                "SCBH1.SOLR_Daym0|OFRH1.DWPF_Daym1",
                "SCBH1.SOLR_Daym0|OFRH1.SKNT_Daym1",
                "SCBH1.SOLR_Daym0|OFRH1.TMPF_Daym1"
            )    

SOLR PREC RELH DWPF SKNT TMPF
2820.72 1962.721 2813.887 2721.729 1636.789 2396.521

Station PLHH1 (South)

formulas <- c(
                "SCBH1.SOLR_Daym0|PLHH1.SOLR_Daym1",
                "SCBH1.SOLR_Daym0|PLHH1.PREC_Daym1",
                "SCBH1.SOLR_Daym0|PLHH1.RELH_Daym1",
                "SCBH1.SOLR_Daym0|PLHH1.DWPF_Daym1",
                "SCBH1.SOLR_Daym0|PLHH1.SKNT_Daym1",
                "SCBH1.SOLR_Daym0|PLHH1.TMPF_Daym1"
            )    

SOLR PREC RELH DWPF SKNT TMPF
2818.312 2360.254 2072.922 2844.57 2728.284 2918.546

Station WNVH1 (West)

formulas <- c(
                "SCBH1.SOLR_Daym0|WNVH1.SOLR_Daym1",
                "SCBH1.SOLR_Daym0|WNVH1.PREC_Daym1",
                "SCBH1.SOLR_Daym0|WNVH1.RELH_Daym1",
                "SCBH1.SOLR_Daym0|WNVH1.DWPF_Daym1",
                "SCBH1.SOLR_Daym0|WNVH1.SKNT_Daym1",
                "SCBH1.SOLR_Daym0|WNVH1.TMPF_Daym1"
            )    

SOLR PREC RELH DWPF SKNT TMPF
1983.808 2825.611 2724.506 2403.816 2342.721 2598.258

Question 2: For varying skipping time lengths?

We vary here the skipping time (2012,2013), (2011,2012), etc to understand if the model could benefit of being online.

For 1 Consecutive Day.

formula <- "SCBH1.SOLR_Daym0|SCBH1.SOLR_Daym1"

time.lengths <- c(2003,
                  2004,
                  2005,
                  2006,
                  2007,
                  2008,
                  2009,
                  2010,
                  2011,
                  2012,
                  2013
    )

2003,2004 2004,2005 2005,2006 2006,2007 2007,2008 2008,2009 2009,2010 2010,2011 2011,2012 2012,2013
1689.538 1577.491 1570.457 1563.433 1627.389 1579.865 1628.269 1739.696 1657.253 1714.256

For 2 Consecutive Days.

formula <- "SCBH1.SOLR_Daym0|SCBH1.SOLR_Daym1,SCBH1.SOLR_Daym2"

time.lengths <- c(2003,
                  2004,
                  2005,
                  2006,
                  2007,
                  2008,
                  2009,
                  2010,
                  2011,
                  2012,
                  2013
    )

2003,2004 2004,2005 2005,2006 2006,2007 2007,2008 2008,2009 2009,2010 2010,2011 2011,2012 2012,2013
1708.717 2016.282 1714.291 1711.572 1613.302 1594.277 1648.36 2564.848 1786.643 1856.056

For 3 Consecutive Days.

formula <- "SCBH1.SOLR_Daym0|SCBH1.SOLR_Daym1,SCBH1.SOLR_Daym2,SCBH1.SOLR_Daym3"

time.lengths <- c(2003,
                  2004,
                  2005,
                  2006,
                  2007,
                  2008,
                  2009,
                  2010,
                  2011,
                  2012,
                  2013
    )

2003,2004 2004,2005 2005,2006 2006,2007 2007,2008 2008,2009 2009,2010 2010,2011 2011,2012 2012,2013
2049.321 2139.79 1887.671 1760.594 1733.617 1687.404 1711.279 2349.581 1846.352 1954.925

Results

Question 1

It is interesting to observe that the best model out of all possible combinations of the weather variables and stations listed was by “SCBH1.SOLR_Daym0|OFRH1.SKNT_Daym1” ( Daily Average Error of 1636 W/m^2). This is followed by “SCBH1.SOLR_Daym0|SCBH1.SOLR_Daym1” (Daily Average Error of 1740 W/m^2). SKNT stands for the Wind Speed Variable. Furthermore, OFRH1 is the most distant station across all the 3 used for the model.

Question 2

If we vary the time length in pairs of years going back to the past, and consider the amount of consecutive days from 1 (previous day), to 3 (previous days), then the best model performance occurs when the model is trained with the pair 2007,2008 (Daily Average Error of 1560 (W/m^2)) using only 1 consecutive day. For 1 consecutive day, the most recent available data results in 1710 (W/m^2) error units.

It is also interesting to note that the best and worst training set is inconsistent across different amount of consecutive days. Hence, we we are led to believe an online model yearly based, i.e. that always has available the latest year of data, would not necessarily yield the best performance.

Overall

Overall the following experiment didn’t observe any significant error reduction, and the error ranged between 1500 to 1700 units which is similar to what has been reported in our published work.

Future Work

For the following questions, I intend to reduce the 3 displayed previous plots into a scatterplot of errors. This will make the comparison easier, given that a combination of station and weather variable model will result in just a new point in the plot. I will then add more combinations of observed variables, and create different scatterplots by varying the time length. Once put side by side, we should be easily able to assess parameters (1) to (5) and be able to judge clearly if the margin of 15% error can be surpassed.

For question 1, other combinations of models needs to be tested mixing weather variables, and for question 2 investigating wether an online model that attempts to forecast the next day for a smaller future time frame seems pertinent (i.e. instead of having available the last full year of data to forecast the next 1 full year, attempt to just have the most current day of data to forecast just the next day).