The WPD/Codalab/Catapult Challenge 1 was directed at estimating “the
highest peak value and lowest trough at a one minute resolution within
each half hourly period given only half hourly measurements.”
Looking at the 3x sites identified, there is some additional WPD open
data that could assist here (the <external_dist_subs-3.csv> data
from connecteddata.westernpower.co.uk)
#wps <- fread("https://connecteddata.westernpower.co.uk/dataset/29d435c2-0cbe-442d-96fe-a229a0307fba/resource/1aac8af9-813b-45f8-8d70-9622d696d76b/download/external_dist_subs-3.csv")
wpss <- wps %>%
select(`Primary Name`, Solar) %>%
filter(`Primary Name` %in% c("Staplegrove", "Mousehole", "Geevor")) %>%
group_by(`Primary Name`) %>%
summarise(solar = sum(Solar))
ggplot(wpss, aes(x=`Primary Name`, y=solar)) +
geom_bar(stat="Identity") +
theme_economist()

It isn’t clear if these totals are the sum of solar installations or
kW registered. However, either way, there appears to be more solar at
Staplegrove.
Take a look at the variability in the 1-min dataset at Staplegrove,
if solarPV is an issue, split the data into daylight and night-time
series (suncalc provides sun altitude for any lat-long, i.e. night-time
occurs for negative solar altitudes)
Using boxplots for each hour’s values.
ggplot(data=sm1n) +
geom_boxplot(aes(x=hh, y=hll, group=hh)) +
labs(
title="Staplegrove: (value-max - value-min) by hour of day\nNight-time hours only (1min data)\nRed lines at 0.1, 0.05, i.e. 10%, 5%",
x="Hour of day",
y="(Maxvalue - Minvalue) / Value Note: Log scale"
) +
geom_hline(yintercept=c(0.1,0.05),
color="red") +
scale_y_log10(labels =trans_format('log10', math_format(10^.x)),
breaks=breaks_log(n=4)) +
scale_x_continuous(breaks = c(0,3, 6,9,12,15,18,21,24)) +
theme_economist()

The variability in night-time values is, except for a few outliers,
below 10%, and even 5% covers the variability of most values.
Graphing the same measurements for daylight hours:
ggplot(data=sm1d) +
geom_boxplot(aes(x=hh, y=hll, group=hh)) +
labs(
title="Staplegrove: (value-max - value-min) by hour of day\nDaylight hours only (1min data)\nRed lines at 0.1, 0.05, i.e. 10%, 5%",
x="Hour of day",
y="(Maxvalue - Minvalue) / Value Note: Log scale"
) +
geom_hline(yintercept=c(0.1,0.05),
color="red") +
scale_y_log10(labels =trans_format('log10', math_format(10^.x)),
breaks=breaks_log(n=4)) +
scale_x_continuous(breaks = c(0,3, 6,9,12,15,18,21,24)) +
theme_economist()

The day time variability in values exceeds 10% more frequently,
particularly during ‘peak solar’ times.
Looking at the variability in max-min values (30 min data) at
Staplegrove in the same way:
ggplot(data=san) +
geom_boxplot(aes(x=hh, y=hl, group=hh)) +
labs(
title="Staplegrove: (value-max - value-min) by hour of day\nNight-time hours only (30min data)\nRed lines at 0.1, 0.05, i.e. 10%, 5%",
x="Hour of day",
y="Difference: Maxvalue - Minvalue Note: Log scale"
) +
geom_hline(yintercept=c(0.1,0.05),
color="red") +
scale_y_log10(labels =trans_format('log10', math_format(10^.x)),
breaks=breaks_log(n=4)) +
scale_x_continuous(breaks = c(0,3, 6,9,12,15,18,21,24)) +
theme_economist()

While for daylight hours
ggplot(data=sad) +
geom_boxplot(aes(x=hh, y=hl, group=hh)) +
labs(
title="Staplegrove: (value-max - value-min) by hour of day\nDaylight hours only (30min data)\nRed lines at 0.1, 0.05, i.e. 10%, 5%",
x="Hour of day",
y="Difference: Maxvalue - Minvalue Note: Log scale"
) +
geom_hline(yintercept=c(0.1,0.05),
color="red") +
scale_y_log10(labels =trans_format('log10', math_format(10^.x)),
breaks=breaks_log(n=4)) +
scale_x_continuous(breaks = c(0,3, 6,9,12,15,18,21,24)) +
theme_economist()

It does look like solarPV contributes significantly to the
variability in data.
boxplot(sad$hl, san$hl,
main="Staplegrove 30 min data\nCompare day and night",
xlab="Day (1) and Night (2)",
ylab="Range of variation")

In comparison, Mousehole has fewer/less solarPV than Staplegrove
ggplot(data=man) +
geom_boxplot(aes(x=hh, y=hll, group=hh)) +
labs(
title="Mousehole: (value-max - value-min) by hour of day\nNight-time hours only (30min data)\nRed lines at 0.1, 0.05, i.e. 10%, 5%",
x="Hour of day",
y="(Maxvalue - Minvalue)/value Note: Log scale"
) +
geom_hline(yintercept=c(0.1,0.05),
color="red") +
scale_y_log10(labels =trans_format('log10', math_format(10^.x)),
breaks=breaks_log(n=4)) +
scale_x_continuous(breaks = c(0,3, 6,9,12,15,18,21,24)) +
theme_economist()

While for daylight hours:
ggplot(data=mad) +
geom_boxplot(aes(x=hh, y=hll, group=hh)) +
labs(
title="Mousehole: (value-max - value-min) by hour of day\nDaylight hours only (30min data)\nRed lines at 0.1, 0.05, i.e. 10%, 5%",
x="Hour of day",
y="(Maxvalue - Minvalue)/value Note: Log scale") +
geom_hline(yintercept=c(0.1,0.05),
color="red") +
scale_y_log10(labels =trans_format('log10', math_format(10^.x)),
breaks=breaks_log(n=4)) +
scale_x_continuous(breaks = c(0,3, 6,9,12,15,18,21,24)) +
theme_economist()

Summary: the variability in max-min values is affected by the amount
of solarPV connected to the Primary substation’s network. Boxplot of
absolute range (value-max - value-min) for Staplegrove and
Mousehole:
boxplot(sad$hl, mad$hl,
main="Staplegrove/Mousehole 30 min data\nDaylight only",
xlab="Staplegrove vs Mousehole",
ylab="Range of variation")

While at night:
boxplot(san$hl, man$hl,
main="Staplegrove/Mousehole 30 min data\nNight-time only",
xlab="Staplegrove vs Mousehole",
ylab="Range of variation")

---
title: "WPD OpenData Challenge 1"
output: 
  html_notebook: 
    code_folding: hide
    fig_width: 2
    fig_height: 2
---

The WPD/Codalab/Catapult Challenge 1 was directed at estimating "the highest peak value and lowest trough at a one minute resolution within each half hourly period given only half hourly measurements."

Looking at the 3x sites identified, there is some additional WPD open data that could assist here (the <external_dist_subs-3.csv> data from connecteddata.westernpower.co.uk)


```{r fig.height=2, fig.width=2}
#wps <- fread("https://connecteddata.westernpower.co.uk/dataset/29d435c2-0cbe-442d-96fe-a229a0307fba/resource/1aac8af9-813b-45f8-8d70-9622d696d76b/download/external_dist_subs-3.csv")
wpss <- wps %>%
  select(`Primary Name`, Solar) %>%
  filter(`Primary Name` %in% c("Staplegrove", "Mousehole", "Geevor")) %>%
  group_by(`Primary Name`) %>%
  summarise(solar = sum(Solar))

ggplot(wpss, aes(x=`Primary Name`, y=solar)) +
  geom_bar(stat="Identity") +
  theme_economist()
```

It isn't clear if these totals are the sum of solar installations or kW registered.
However, either way, there appears to be more solar at Staplegrove.

Take a look at the variability in the 1-min dataset at Staplegrove, if solarPV is an issue, split the data into daylight and night-time series (suncalc provides sun altitude for any lat-long, i.e. night-time occurs for negative solar altitudes) 

Using boxplots for each hour's values.
```{r warning=FALSE}
ggplot(data=sm1n) +
  geom_boxplot(aes(x=hh, y=hll, group=hh)) +
  labs(
    title="Staplegrove: (value-max - value-min) by hour of day\nNight-time hours only (1min data)\nRed lines at 0.1, 0.05, i.e. 10%, 5%",
    x="Hour of day",
    y="(Maxvalue - Minvalue) / Value Note: Log scale") +
  geom_hline(yintercept=c(0.1,0.05),
             color="red") +
  scale_y_log10(labels =trans_format('log10', math_format(10^.x)),
                breaks=breaks_log(n=4)) +
  scale_x_continuous(breaks = c(0,3, 6,9,12,15,18,21,24)) +
  theme_economist()
```

The variability in night-time values is, except for a few outliers, below 10%, and even 5% covers the variability of most values.

Graphing the same measurements for daylight hours:
```{r warning=FALSE}
ggplot(data=sm1d) +
  geom_boxplot(aes(x=hh, y=hll, group=hh)) +
  labs(
    title="Staplegrove: (value-max - value-min) by hour of day\nDaylight hours only (1min data)\nRed lines at 0.1, 0.05, i.e. 10%, 5%",
    x="Hour of day",
    y="(Maxvalue - Minvalue) / Value Note: Log scale") +
  geom_hline(yintercept=c(0.1,0.05),
             color="red") +
  scale_y_log10(labels =trans_format('log10', math_format(10^.x)),
                breaks=breaks_log(n=4)) +
  scale_x_continuous(breaks = c(0,3, 6,9,12,15,18,21,24)) +
  theme_economist()
```
The day time variability in values exceeds 10% more frequently, particularly during 'peak solar' times. 

Looking at the variability in max-min values (30 min data) at Staplegrove in the same way:

```{r warning=FALSE}
ggplot(data=san) +
  geom_boxplot(aes(x=hh, y=hl, group=hh)) +
  labs(
    title="Staplegrove: (value-max - value-min) by hour of day\nNight-time hours only (30min data)\nRed lines at 0.1, 0.05, i.e. 10%, 5%",
    x="Hour of day",
    y="Difference: Maxvalue - Minvalue Note: Log scale") +
  geom_hline(yintercept=c(0.1,0.05),
             color="red") +
  scale_y_log10(labels =trans_format('log10', math_format(10^.x)),
                breaks=breaks_log(n=4)) +
  scale_x_continuous(breaks = c(0,3, 6,9,12,15,18,21,24)) +
  theme_economist()
```
While for daylight hours
```{r}
ggplot(data=sad) +
  geom_boxplot(aes(x=hh, y=hl, group=hh)) +
  labs(
    title="Staplegrove: (value-max - value-min) by hour of day\nDaylight hours only (30min data)\nRed lines at 0.1, 0.05, i.e. 10%, 5%",
    x="Hour of day",
    y="Difference: Maxvalue - Minvalue Note: Log scale") +
  geom_hline(yintercept=c(0.1,0.05),
             color="red") +
  scale_y_log10(labels =trans_format('log10', math_format(10^.x)),
                breaks=breaks_log(n=4)) +
  scale_x_continuous(breaks = c(0,3, 6,9,12,15,18,21,24)) +
  theme_economist()
```

It does look like solarPV contributes significantly to the variability in data.
```{r}
boxplot(sad$hl, san$hl,
        main="Staplegrove 30 min data\nCompare day and night",
        xlab="Day vs Night",
        ylab="Range of variation")
```
In comparison, Mousehole has fewer/less solarPV than Staplegrove

```{r warning=FALSE}
ggplot(data=man) +
  geom_boxplot(aes(x=hh, y=hll, group=hh)) +
  labs(
    title="Mousehole: (value-max - value-min) by hour of day\nNight-time hours only (30min data)\nRed lines at 0.1, 0.05, i.e. 10%, 5%",
    x="Hour of day",
    y="(Maxvalue - Minvalue)/value Note: Log scale") +
  geom_hline(yintercept=c(0.1,0.05),
             color="red") +
  scale_y_log10(labels =trans_format('log10', math_format(10^.x)),
                breaks=breaks_log(n=4)) +
  scale_x_continuous(breaks = c(0,3, 6,9,12,15,18,21,24)) +
  theme_economist()
```
While for daylight hours:
```{r warning=FALSE}
ggplot(data=mad) +
  geom_boxplot(aes(x=hh, y=hll, group=hh)) +
  labs(
    title="Mousehole: (value-max - value-min) by hour of day\nDaylight hours only (30min data)\nRed lines at 0.1, 0.05, i.e. 10%, 5%",
    x="Hour of day",
    y="(Maxvalue - Minvalue)/value Note: Log scale") +
  geom_hline(yintercept=c(0.1,0.05),
             color="red") +
  scale_y_log10(labels =trans_format('log10', math_format(10^.x)),
                breaks=breaks_log(n=4)) +
  scale_x_continuous(breaks = c(0,3, 6,9,12,15,18,21,24)) +
  theme_economist()
```

Summary: the variability in max-min values is affected by the amount of solarPV connected to the Primary substation's network.
Boxplot of absolute range (value-max - value-min) for Staplegrove and Mousehole:
```{r}
boxplot(sad$hl, mad$hl,
        main="Staplegrove/Mousehole 30 min data\nDaylight only",
        xlab="Staplegrove vs Mousehole",
        ylab="Range of variation")
```

While at night:
```{r}
boxplot(san$hl, man$hl,
        main="Staplegrove/Mousehole 30 min data\nNight-time only",
        xlab="Staplegrove vs Mousehole",
        ylab="Range of variation")
```

