We are interested in conducting 1-Day ahead surface solar forecasting using probability models. For this report, we chose to use Schofield Barracks (SCBH1) as target station to forecast surface solar radiation.
The interactive version can be found clicking here.
As noted on the map, we chose SCBH1 given it’s location on the center of the island. Furthermore, SCBH1 is the station closest to the center with a sensor that samples solar radiation itensity (W/m^2) (indicated by a sun on the figure). Luckly, SCBH1 also has a suitable time length of data (starting in 2002 up until the day of this report, 2015).
We can also note from the map that 3 stations to the west, south, and east contain solar data. The north stations unfortunately contain only 2015 solar data and therefore are not considered further. Namely,the closest stations surrounding SCBH1 with solar data are:
The weather variables available on them are as follows:
The remainder of this work is interested in observing if any noticiable improvement on the forecasting performance is observable. In summary, the following parameters must also be taken into account and may affect the performance of the model:
We discuss next where specifically on the definition of the probability model those questions are instantiated as a set of parameters.
#Specify the model
formula <- "SCBH1.SOLR_Daym0|PLHH1.SOLR_Daym1,WNVH1.SOLR_Daym1,OFRH1.TMPF_Daym2"
The formula above use our definition of features. It specifies a probability model whose feature of interest is the solar radiation at the current day on station SCBH1, given it is known the solar radiation of the same station on the previous day, and the precipitation 2 days before.
Specifically, the formula parameterizes questions (1) to (3). It is also important to note that the amount of combinations beyond 3 days often leads to poor performance due to the possible amount of combinations. We defer to a further report investigating the trade-off between reducing the amount of centroids, time length and amount of consecutive days.
We currently manually specify per weather variable what is the most appropriate discretization method for each weather variable and the associated number of centroids. Currently kmeans and average min-max bining are available. The following is the current configuration of parameters.
#Defines how to discretize each weather variable
model[["discretize"]] <- list()
#Precipitation
model[["discretize"]][["PREC"]][["method"]] <- "binning"
model[["discretize"]][["PREC"]][["ncentroids"]] <- 5
#Solar
model[["discretize"]][["SOLR"]][["method"]] <- "kmeans"
model[["discretize"]][["SOLR"]][["ncentroids"]] <- 5
#Humidity
model[["discretize"]][["RELH"]][["method"]] <- "kmeans"
model[["discretize"]][["RELH"]][["ncentroids"]] <- 5
#Dew Point
model[["discretize"]][["DWPF"]][["method"]] <- "kmeans"
model[["discretize"]][["DWPF"]][["ncentroids"]] <- 5
#Wind Speed
model[["discretize"]][["SKNT"]][["method"]] <- "kmeans"
model[["discretize"]][["SKNT"]][["ncentroids"]] <- 5
#Temperature
model[["discretize"]][["TMPF"]][["method"]] <- "kmeans"
model[["discretize"]][["TMPF"]][["ncentroids"]] <- 5
#station.train <- sapply(station,filter.year.interval,2012:2013)
#station.test <- sapply(station,filter.year.interval,2014)
Lastly we specify the time length for the training data and pass as a parameter along with the loaded data, formula and method of discretization. If a test length is specified, the model will also output the error of the forecast.
set.seed(1234)
pmodel <- joint_probability(
data=model[["data"]],
formula=model[["formula"]],
discretize=model[["discretize"]],
time.length=2012:2013,
test.length=2014)
For the remainder of this work, the parameters are used as previously specified, aside from those defined by the formula which will vary according to the questions we will pose and answer on the following section.
Like indicated before on the objective. The following question specifies a set of parameters to investigate if any significant diffence occur on the model performance. Hence the following questions are prefixed by “Does the model performance improve..”
We define the following formulas to answer this question:
formulas <- c(
"SCBH1.SOLR_Daym0|SCBH1.SOLR_Daym1",
"SCBH1.SOLR_Daym0|SCBH1.PREC_Daym1",
"SCBH1.SOLR_Daym0|SCBH1.RELH_Daym1",
"SCBH1.SOLR_Daym0|SCBH1.DWPF_Daym1",
"SCBH1.SOLR_Daym0|SCBH1.SKNT_Daym1",
"SCBH1.SOLR_Daym0|SCBH1.TMPF_Daym1"
)
| SOLR | PREC | RELH | DWPF | SKNT | TMPF |
|---|---|---|---|---|---|
| 1750.395 | 2477.534 | 2315.263 | 2627.718 | 2633.518 | 1952.693 |
formulas <- c(
"SCBH1.SOLR_Daym0|OFRH1.SOLR_Daym1",
"SCBH1.SOLR_Daym0|OFRH1.PREC_Daym1",
"SCBH1.SOLR_Daym0|OFRH1.RELH_Daym1",
"SCBH1.SOLR_Daym0|OFRH1.DWPF_Daym1",
"SCBH1.SOLR_Daym0|OFRH1.SKNT_Daym1",
"SCBH1.SOLR_Daym0|OFRH1.TMPF_Daym1"
)
| SOLR | PREC | RELH | DWPF | SKNT | TMPF |
|---|---|---|---|---|---|
| 2820.72 | 1962.721 | 2813.887 | 2721.729 | 1636.789 | 2396.521 |
formulas <- c(
"SCBH1.SOLR_Daym0|PLHH1.SOLR_Daym1",
"SCBH1.SOLR_Daym0|PLHH1.PREC_Daym1",
"SCBH1.SOLR_Daym0|PLHH1.RELH_Daym1",
"SCBH1.SOLR_Daym0|PLHH1.DWPF_Daym1",
"SCBH1.SOLR_Daym0|PLHH1.SKNT_Daym1",
"SCBH1.SOLR_Daym0|PLHH1.TMPF_Daym1"
)
| SOLR | PREC | RELH | DWPF | SKNT | TMPF |
|---|---|---|---|---|---|
| 2818.312 | 2360.254 | 2072.922 | 2844.57 | 2728.284 | 2918.546 |
formulas <- c(
"SCBH1.SOLR_Daym0|WNVH1.SOLR_Daym1",
"SCBH1.SOLR_Daym0|WNVH1.PREC_Daym1",
"SCBH1.SOLR_Daym0|WNVH1.RELH_Daym1",
"SCBH1.SOLR_Daym0|WNVH1.DWPF_Daym1",
"SCBH1.SOLR_Daym0|WNVH1.SKNT_Daym1",
"SCBH1.SOLR_Daym0|WNVH1.TMPF_Daym1"
)
| SOLR | PREC | RELH | DWPF | SKNT | TMPF |
|---|---|---|---|---|---|
| 1983.808 | 2825.611 | 2724.506 | 2403.816 | 2342.721 | 2598.258 |
We vary here the skipping time (2012,2013), (2011,2012), etc to understand if the model could benefit of being online.
formula <- "SCBH1.SOLR_Daym0|SCBH1.SOLR_Daym1"
time.lengths <- c(2003,
2004,
2005,
2006,
2007,
2008,
2009,
2010,
2011,
2012,
2013
)
| 2003,2004 | 2004,2005 | 2005,2006 | 2006,2007 | 2007,2008 | 2008,2009 | 2009,2010 | 2010,2011 | 2011,2012 | 2012,2013 |
|---|---|---|---|---|---|---|---|---|---|
| 1689.538 | 1577.491 | 1570.457 | 1563.433 | 1627.389 | 1579.865 | 1628.269 | 1739.696 | 1657.253 | 1714.256 |
formula <- "SCBH1.SOLR_Daym0|SCBH1.SOLR_Daym1,SCBH1.SOLR_Daym2"
time.lengths <- c(2003,
2004,
2005,
2006,
2007,
2008,
2009,
2010,
2011,
2012,
2013
)
| 2003,2004 | 2004,2005 | 2005,2006 | 2006,2007 | 2007,2008 | 2008,2009 | 2009,2010 | 2010,2011 | 2011,2012 | 2012,2013 |
|---|---|---|---|---|---|---|---|---|---|
| 1708.717 | 2016.282 | 1714.291 | 1711.572 | 1613.302 | 1594.277 | 1648.36 | 2564.848 | 1786.643 | 1856.056 |
formula <- "SCBH1.SOLR_Daym0|SCBH1.SOLR_Daym1,SCBH1.SOLR_Daym2,SCBH1.SOLR_Daym3"
time.lengths <- c(2003,
2004,
2005,
2006,
2007,
2008,
2009,
2010,
2011,
2012,
2013
)
| 2003,2004 | 2004,2005 | 2005,2006 | 2006,2007 | 2007,2008 | 2008,2009 | 2009,2010 | 2010,2011 | 2011,2012 | 2012,2013 |
|---|---|---|---|---|---|---|---|---|---|
| 2049.321 | 2139.79 | 1887.671 | 1760.594 | 1733.617 | 1687.404 | 1711.279 | 2349.581 | 1846.352 | 1954.925 |
Question 1
It is interesting to observe that the best model out of all possible combinations of the weather variables and stations listed was by “SCBH1.SOLR_Daym0|OFRH1.SKNT_Daym1” ( Daily Average Error of 1636 W/m^2). This is followed by “SCBH1.SOLR_Daym0|SCBH1.SOLR_Daym1” (Daily Average Error of 1740 W/m^2). SKNT stands for the Wind Speed Variable. Furthermore, OFRH1 is the most distant station across all the 3 used for the model.
Question 2
If we vary the time length in pairs of years going back to the past, and consider the amount of consecutive days from 1 (previous day), to 3 (previous days), then the best model performance occurs when the model is trained with the pair 2007,2008 (Daily Average Error of 1560 (W/m^2)) using only 1 consecutive day. For 1 consecutive day, the most recent available data results in 1710 (W/m^2) error units.
It is also interesting to note that the best and worst training set is inconsistent across different amount of consecutive days. Hence, we we are led to believe an online model yearly based, i.e. that always has available the latest year of data, would not necessarily yield the best performance.
Overall
Overall the following experiment didn’t observe any significant error reduction, and the error ranged between 1500 to 1700 units which is similar to what has been reported in our published work.
For the following questions, I intend to reduce the 3 displayed previous plots into a scatterplot of errors. This will make the comparison easier, given that a combination of station and weather variable model will result in just a new point in the plot. I will then add more combinations of observed variables, and create different scatterplots by varying the time length. Once put side by side, we should be easily able to assess parameters (1) to (5) and be able to judge clearly if the margin of 15% error can be surpassed.
For question 1, other combinations of models needs to be tested mixing weather variables, and for question 2 investigating wether an online model that attempts to forecast the next day for a smaller future time frame seems pertinent (i.e. instead of having available the last full year of data to forecast the next 1 full year, attempt to just have the most current day of data to forecast just the next day).