Objective

The purpose of this report is to investigate how precipitation (and later other weather variables) can improve solar forecasting using our argmax probability model. The data of this report is from Schofield Barracks (SCBH1).

At SCBH1, the solar radiation is sampled every hour, ranging from 2003 to 2014.

MON DAY YEAR HR MIN TMZN TMPF RELH SKNT GUST DRCT QFLG SOLR TLKE PREC SINT FT FM PEAK HI24 LO24 PDIR VOLT DWPF
1 1 2003 0 55 HST 57 92 5 5 240 0 0 NA 18.94 NA 57 20 NA NA NA NA NA 54.7
1 1 2003 1 55 HST 59 87 3 6 240 0 0 NA 18.94 NA 55 21 NA NA NA NA NA 55.1
1 1 2003 2 55 HST 58 92 3 3 230 0 0 NA 18.94 NA 55 22 NA NA NA NA NA 55.7
1 1 2003 3 55 HST 57 92 3 5 250 0 0 NA 18.94 NA 56 23 NA NA NA NA NA 54.7
1 1 2003 4 55 HST 58 92 2 4 230 0 0 NA 18.94 NA 56 24 NA NA NA NA NA 55.7
1 1 2003 5 55 HST 57 94 3 5 240 0 0 NA 18.94 NA 54 26 NA NA NA NA NA 55.3

Pre-Processing

Missing Data-Points

Currently we remove vectors who contain any missing value between 8 to 17h. Furthermore, the probability model discard any window that contain any missing day.

Vectorization

We vectorize each weather variable of interest (currently solar and precipitation) in the dataset to daily numeric vectors. Each row on the dataset will be a day of the year, and it’s 24 columns will represent the solar radiation at the hour 0,1,…,23. For instance, the first few rows of solar radiation are:

YEAR MON DAY 8 9 10 11 12 13 14 15 16 17
2003 1 1 322 507 746 790 770 874 645 402 94 7
2003 1 2 131 236 197 225 213 179 145 57 27 5
2003 1 3 314 166 288 470 323 194 161 423 46 7
2003 1 4 64 130 808 793 807 752 150 68 20 3
2003 1 5 53 644 691 821 244 269 236 404 54 10
2003 1 6 226 730 970 500 785 402 632 403 199 10

Since the early and late hours of the day contain little to no solar radiation, we keep only hours between 8 to 17 across all weather variables of interest. For instance for solar:

YEAR MON DAY 8 9 10 11 12 13 14 15 16 17
2003 1 1 322 507 746 790 770 874 645 402 94 7
2003 1 2 131 236 197 225 213 179 145 57 27 5
2003 1 3 314 166 288 470 323 194 161 423 46 7
2003 1 4 64 130 808 793 807 752 150 68 20 3
2003 1 5 53 644 691 821 244 269 236 404 54 10
2003 1 6 226 730 970 500 785 402 632 403 199 10

Discretization

In order to calculate the probability distributions, we need to first discretize each weather variable. Depending on the weather variable, the discretization method is different.

Solar - 5 Means Clustering

For the solar radiation vectors, we apply 5-means clustering. We chose 5 clusters because this created a clear separation between the daily solar radiations over the years used on the dataset.

The id of each centroid (1 to 5) are ordered by the total solar intensity of the centroid (5 is highest, 1 is lowest).

8 9 10 11 12 13 14 15 16 17
96.45584 145.7208 184.0028 208.9416 225.6738 235.2094 239.6068 238.3433 203.0014 111.5228
275.89212 431.7226 542.3784 549.3476 472.7911 394.2021 338.9863 254.0959 164.6301 67.0274
171.47927 302.4083 428.9764 542.8919 646.5695 679.3775 633.5432 530.5714 347.3910 149.9687
362.70761 583.1298 739.1073 815.1869 784.6263 689.3010 566.5087 413.2837 266.8270 109.6263
355.95481 572.0663 756.5471 896.4277 969.5854 957.8625 856.8865 690.5565 459.9856 198.4452

Precipitation - Binning

For precipitation, we observed almost no variation across the values throgout the year when compared to SOLR:

In fact, performing k-means over precipitation only result in lines. Therefore, we decided to perform binning instead. To do so, we first identify the minimum mean and maximum mean across all precipitation vectors throughout the training set. Next, we divide the range by 5 to identify the bins interval. Finally we assign each daily vector to it’s associated bin. This yield the following bins:

In order from 1 to 5 the centroids are the mean of the following intervals:

interval_labels Profile
(-0.101,20.2] 1
(20.2,40.4] 2
(40.4,60.6] 3
(60.6,80.8] 4
(80.8,101] 5

Probability Model

Weather Variable Consecutive Days

To create the probability model with multiple weather variables, we decide first how many consecutive days we want associated to each of them.

Previously, we used only solar radiation to forecast Daym0 (i.e. P(Daym0|Daym1,Daym2,…,Daymn). For P(Daym0|Daym1), the following consecutive days table would suffice to derive the joint, conditional probability and lastly the argmax model:

Daym1 Daym0
4 3
3 3
3 2
2 2
2 4
4 2

To include a size of 1 consecutive days for precipitation from it’s table of consecutive days,

Daym1 Daym0
1 1
1 1
1 1
1 1
1 1
1 1

We must be careful to use Daym1 instead of Daym0, otherwise the model would be using signal from the same day of forecast. By putting together both tables (Daym1Prec,Daym1Solar,Daym0Solar) of consecutive days we have therefore:

precDaym1 Daym1 Daym0
1 4 3
1 3 3
1 3 2
1 2 2
1 2 4
1 4 2

Note that before creating the tables of consecutive days, each vector table is filled with “dummy vectors” on the missing data days. This ensures the consecutive days tables are identified by “NA” and therefore are filtered before turned into probability distributions. For this particular case also, since we use windows of size 2 for both weather variables, we do not need to align their timeframe (This is a TO-DO for different jointsizes).

Histogram and Distribution

With the consecutive days table we can calculate the histogram for multi-weather variable by counting how many times the rows of the consecutive days table repeat:

precDaym1 Daym1 Daym0 Freq
1 1 1 96
1 1 2 24
1 1 3 8
1 1 4 66
1 1 5 37
1 1 NA 11

By keeping only the complete cases of the histogram, we ensure the model does not skip days:

precDaym1 Daym1 Daym0 Freq
1 1 1 1 96
2 1 1 2 24
3 1 1 3 8
4 1 1 4 66
5 1 1 5 37
7 1 2 1 24

Lastly, by normalizing the frequencies we obtain the distribution:

precDaym1 Daym1 Daym0 Freq Prob
1 1 1 1 96 0.0356877
2 1 1 2 24 0.0089219
3 1 1 3 8 0.0029740
4 1 1 4 66 0.0245353
5 1 1 5 37 0.0137546
7 1 2 1 24 0.0089219

Conditional Probability and Argmax Model

To conclude the model, we need to calculate P(Daym0Solar | Daym1Solar,Daym1Prec) and then apply argmax.

model <- model.argmax(histogram)

Evaluation

Having defined the model using data from 2003 to 2013, we evaluate it using the 2014 data. In this case, only the first 2 days are not forecasted (the total window size is 3 hence the lack of forecast).

## [1] 1680.946

Conclusion

We conclude that despite the observed higher mutual information between past precipitation and solar radiation, when compared to past solar radiation to forecast itself, the model combined both information does not yield better results.