The purpose of this report is to investigate how precipitation (and later other weather variables) can improve solar forecasting using our argmax probability model. The data of this report is from Schofield Barracks (SCBH1).
At SCBH1, the solar radiation is sampled every hour, ranging from 2003 to 2014.
| MON | DAY | YEAR | HR | MIN | TMZN | TMPF | RELH | SKNT | GUST | DRCT | QFLG | SOLR | TLKE | PREC | SINT | FT | FM | PEAK | HI24 | LO24 | PDIR | VOLT | DWPF |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 2003 | 0 | 55 | HST | 57 | 92 | 5 | 5 | 240 | 0 | 0 | NA | 18.94 | NA | 57 | 20 | NA | NA | NA | NA | NA | 54.7 |
| 1 | 1 | 2003 | 1 | 55 | HST | 59 | 87 | 3 | 6 | 240 | 0 | 0 | NA | 18.94 | NA | 55 | 21 | NA | NA | NA | NA | NA | 55.1 |
| 1 | 1 | 2003 | 2 | 55 | HST | 58 | 92 | 3 | 3 | 230 | 0 | 0 | NA | 18.94 | NA | 55 | 22 | NA | NA | NA | NA | NA | 55.7 |
| 1 | 1 | 2003 | 3 | 55 | HST | 57 | 92 | 3 | 5 | 250 | 0 | 0 | NA | 18.94 | NA | 56 | 23 | NA | NA | NA | NA | NA | 54.7 |
| 1 | 1 | 2003 | 4 | 55 | HST | 58 | 92 | 2 | 4 | 230 | 0 | 0 | NA | 18.94 | NA | 56 | 24 | NA | NA | NA | NA | NA | 55.7 |
| 1 | 1 | 2003 | 5 | 55 | HST | 57 | 94 | 3 | 5 | 240 | 0 | 0 | NA | 18.94 | NA | 54 | 26 | NA | NA | NA | NA | NA | 55.3 |
Currently we remove vectors who contain any missing value between 8 to 17h. Furthermore, the probability model discard any window that contain any missing day.
We vectorize each weather variable of interest (currently solar and precipitation) in the dataset to daily numeric vectors. Each row on the dataset will be a day of the year, and it’s 24 columns will represent the solar radiation at the hour 0,1,…,23. For instance, the first few rows of solar radiation are:
| YEAR | MON | DAY | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003 | 1 | 1 | 322 | 507 | 746 | 790 | 770 | 874 | 645 | 402 | 94 | 7 |
| 2003 | 1 | 2 | 131 | 236 | 197 | 225 | 213 | 179 | 145 | 57 | 27 | 5 |
| 2003 | 1 | 3 | 314 | 166 | 288 | 470 | 323 | 194 | 161 | 423 | 46 | 7 |
| 2003 | 1 | 4 | 64 | 130 | 808 | 793 | 807 | 752 | 150 | 68 | 20 | 3 |
| 2003 | 1 | 5 | 53 | 644 | 691 | 821 | 244 | 269 | 236 | 404 | 54 | 10 |
| 2003 | 1 | 6 | 226 | 730 | 970 | 500 | 785 | 402 | 632 | 403 | 199 | 10 |
Since the early and late hours of the day contain little to no solar radiation, we keep only hours between 8 to 17 across all weather variables of interest. For instance for solar:
| YEAR | MON | DAY | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003 | 1 | 1 | 322 | 507 | 746 | 790 | 770 | 874 | 645 | 402 | 94 | 7 |
| 2003 | 1 | 2 | 131 | 236 | 197 | 225 | 213 | 179 | 145 | 57 | 27 | 5 |
| 2003 | 1 | 3 | 314 | 166 | 288 | 470 | 323 | 194 | 161 | 423 | 46 | 7 |
| 2003 | 1 | 4 | 64 | 130 | 808 | 793 | 807 | 752 | 150 | 68 | 20 | 3 |
| 2003 | 1 | 5 | 53 | 644 | 691 | 821 | 244 | 269 | 236 | 404 | 54 | 10 |
| 2003 | 1 | 6 | 226 | 730 | 970 | 500 | 785 | 402 | 632 | 403 | 199 | 10 |
In order to calculate the probability distributions, we need to first discretize each weather variable. Depending on the weather variable, the discretization method is different.
For the solar radiation vectors, we apply 5-means clustering. We chose 5 clusters because this created a clear separation between the daily solar radiations over the years used on the dataset.
The id of each centroid (1 to 5) are ordered by the total solar intensity of the centroid (5 is highest, 1 is lowest).
| 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 |
|---|---|---|---|---|---|---|---|---|---|
| 96.45584 | 145.7208 | 184.0028 | 208.9416 | 225.6738 | 235.2094 | 239.6068 | 238.3433 | 203.0014 | 111.5228 |
| 275.89212 | 431.7226 | 542.3784 | 549.3476 | 472.7911 | 394.2021 | 338.9863 | 254.0959 | 164.6301 | 67.0274 |
| 171.47927 | 302.4083 | 428.9764 | 542.8919 | 646.5695 | 679.3775 | 633.5432 | 530.5714 | 347.3910 | 149.9687 |
| 362.70761 | 583.1298 | 739.1073 | 815.1869 | 784.6263 | 689.3010 | 566.5087 | 413.2837 | 266.8270 | 109.6263 |
| 355.95481 | 572.0663 | 756.5471 | 896.4277 | 969.5854 | 957.8625 | 856.8865 | 690.5565 | 459.9856 | 198.4452 |
For precipitation, we observed almost no variation across the values throgout the year when compared to SOLR:
In fact, performing k-means over precipitation only result in lines. Therefore, we decided to perform binning instead. To do so, we first identify the minimum mean and maximum mean across all precipitation vectors throughout the training set. Next, we divide the range by 5 to identify the bins interval. Finally we assign each daily vector to it’s associated bin. This yield the following bins:
In order from 1 to 5 the centroids are the mean of the following intervals:
| interval_labels | Profile |
|---|---|
| (-0.101,20.2] | 1 |
| (20.2,40.4] | 2 |
| (40.4,60.6] | 3 |
| (60.6,80.8] | 4 |
| (80.8,101] | 5 |
To create the probability model with multiple weather variables, we decide first how many consecutive days we want associated to each of them.
Previously, we used only solar radiation to forecast Daym0 (i.e. P(Daym0|Daym1,Daym2,…,Daymn). For P(Daym0|Daym1), the following consecutive days table would suffice to derive the joint, conditional probability and lastly the argmax model:
| Daym1 | Daym0 |
|---|---|
| 4 | 3 |
| 3 | 3 |
| 3 | 2 |
| 2 | 2 |
| 2 | 4 |
| 4 | 2 |
To include a size of 1 consecutive days for precipitation from it’s table of consecutive days,
| Daym1 | Daym0 |
|---|---|
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
We must be careful to use Daym1 instead of Daym0, otherwise the model would be using signal from the same day of forecast. By putting together both tables (Daym1Prec,Daym1Solar,Daym0Solar) of consecutive days we have therefore:
| precDaym1 | Daym1 | Daym0 |
|---|---|---|
| 1 | 4 | 3 |
| 1 | 3 | 3 |
| 1 | 3 | 2 |
| 1 | 2 | 2 |
| 1 | 2 | 4 |
| 1 | 4 | 2 |
Note that before creating the tables of consecutive days, each vector table is filled with “dummy vectors” on the missing data days. This ensures the consecutive days tables are identified by “NA” and therefore are filtered before turned into probability distributions. For this particular case also, since we use windows of size 2 for both weather variables, we do not need to align their timeframe (This is a TO-DO for different jointsizes).
With the consecutive days table we can calculate the histogram for multi-weather variable by counting how many times the rows of the consecutive days table repeat:
| precDaym1 | Daym1 | Daym0 | Freq |
|---|---|---|---|
| 1 | 1 | 1 | 96 |
| 1 | 1 | 2 | 24 |
| 1 | 1 | 3 | 8 |
| 1 | 1 | 4 | 66 |
| 1 | 1 | 5 | 37 |
| 1 | 1 | NA | 11 |
By keeping only the complete cases of the histogram, we ensure the model does not skip days:
| precDaym1 | Daym1 | Daym0 | Freq | |
|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 96 |
| 2 | 1 | 1 | 2 | 24 |
| 3 | 1 | 1 | 3 | 8 |
| 4 | 1 | 1 | 4 | 66 |
| 5 | 1 | 1 | 5 | 37 |
| 7 | 1 | 2 | 1 | 24 |
Lastly, by normalizing the frequencies we obtain the distribution:
| precDaym1 | Daym1 | Daym0 | Freq | Prob | |
|---|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 96 | 0.0356877 |
| 2 | 1 | 1 | 2 | 24 | 0.0089219 |
| 3 | 1 | 1 | 3 | 8 | 0.0029740 |
| 4 | 1 | 1 | 4 | 66 | 0.0245353 |
| 5 | 1 | 1 | 5 | 37 | 0.0137546 |
| 7 | 1 | 2 | 1 | 24 | 0.0089219 |
To conclude the model, we need to calculate P(Daym0Solar | Daym1Solar,Daym1Prec) and then apply argmax.
model <- model.argmax(histogram)
Having defined the model using data from 2003 to 2013, we evaluate it using the 2014 data. In this case, only the first 2 days are not forecasted (the total window size is 3 hence the lack of forecast).
## [1] 1680.946
We conclude that despite the observed higher mutual information between past precipitation and solar radiation, when compared to past solar radiation to forecast itself, the model combined both information does not yield better results.