MISC

# regarding 'readthedown' theme
# https://cran.r-project.org/web/packages/rmdformats/vignettes/introduction.html

Introduction

Check out this Kaggle

This data has been gathered at two solar power plants in India over a 34 day period. It has two pairs of files - each pair has one power generation dataset and one sensor readings dataset. The power generation datasets are gathered at the inverter level - each inverter has multiple lines of solar panels attached to it. The sensor data is gathered at a plant level - single array of sensors optimally placed at the plant.

There are a few areas of concern at the solar power plant -

  1. Can we predict the power generation for next couple of days? - this allows for better grid management
  2. Can we identify the need for panel cleaning/maintenance?
  3. Can we identify faulty or sub-optimally performing equipment?

Ideas addressing concerns

  1. ARIMA / PROPHET modeling
    • at the inverter level, and then rollup (aggregate sum) to power plant level
  2. inverter data set
    • anomaly detection, boxplots
  3. sensor data set
    • anomaly detection, boxplots

MISC Ideas

  1. If we knew the exact location of the powerplants, we could get local weather information
    • find correlation of weather (sunny/cloudy days) to daily yield to discover normal/abnormal inverters & sensors

DATA DICTIONARY for Power Generation Inverter data sets

  1. AC_POWER : Amount of AC power generated by the inverter (source_key) in this 15 minute interval. Units - kW.
  2. AC_ : Amount of DC power generated by the inverter (source_key) in this 15 minute interval. Units - kW.
  3. DAILY_YIELD : Daily yield is a cumulative sum of power generated on that day, till that point in time.
  4. DATE_TIME : Date and time for each observation. Observations recorded at 15 minute intervals.
  5. PLANT_ID : Plant ID - this will be common for the entire file.
  6. SOURCE_KEY : Source key in this file stands for the inverter id.
  7. TOTAL_YIELD : This is the total yield for the inverter till that point in time.

Power Generation Inverter data set: Plant 1

Get and Clean Data

p1.gd = read_csv('Plant_1_Generation_Data.csv') %>%
  slice_sample(prop = 1) %>% #!!<NOTE>temp, working with a sample of datset for speed purposes
  clean_names() %>%  #lowercase
  select(sort(tidyselect::peek_vars())) %>% #sort cols alphabetically
  select(where(is.factor),where(is.character),where(is.numeric)) #sort cols by data type

#OlsonNames()
#https://stackoverflow.com/questions/41479008/what-is-the-correct-tz-database-time-zone-for-india

p1.gd = p1.gd %>% mutate(
  date_time = as.POSIXct(strptime(p1.gd$date_time, "%d-%m-%Y %H:%M"), tz = 'Asia/Kolkata'),
  source_key = factor(p1.gd$source_key)
) %>% rename(inverter = source_key)

p1.gd$plant_id = NULL

glimpse structure, and sample rows

p1.gd %>% slice_sample(n = 10) %>% DT::datatable()
p1.gd %>% glimpse()
## Rows: 68,778
## Columns: 6
## $ date_time   <dttm> 2020-05-28 05:15:00, 2020-05-24 17:15:00, 2020-05-16 0...
## $ inverter    <fct> 3PZuoBAID5Wc2HD, 3PZuoBAID5Wc2HD, 1BY6WEcLGh8j5v7, adLQ...
## $ ac_power    <dbl> 0.000000, 239.728571, 0.000000, 710.685714, 339.371429,...
## $ daily_yield <dbl> 0.00000, 8180.28571, 0.00000, 4575.14286, 205.71429, 47...
## $ dc_power    <dbl> 0.00000, 2444.57143, 0.00000, 7258.00000, 3454.00000, 5...
## $ total_yield <dbl> 7085451, 7061448, 6265313, 6487309, 7200435, 6226413, 7...

check missing values

p1.gd %>% miss_var_summary()
## # A tibble: 6 x 3
##   variable    n_miss pct_miss
##   <chr>        <int>    <dbl>
## 1 date_time        0        0
## 2 inverter         0        0
## 3 ac_power         0        0
## 4 daily_yield      0        0
## 5 dc_power         0        0
## 6 total_yield      0        0

EDA: Factor Vars

counts each factor’s unique levels

sapply(p1.gd %>% select(where(is.factor)), n_unique) %>% as.data.frame()
##           .
## inverter 22

reference: names of unique levels

sapply(p1.gd %>% select(where(is.factor)), unique) %>% as.data.frame() %>% arrange()
##           inverter
## 1  3PZuoBAID5Wc2HD
## 2  1BY6WEcLGh8j5v7
## 3  adLQvlD726eNBSB
## 4  YxYtjZvoooNbGkE
## 5  ih0vzX44oOqAx2f
## 6  McdE0feGgRqW7Ca
## 7  uHbuxQJl8lW7ozc
## 8  sjndEbLyjtCKgGv
## 9  pkci93gMrogZuBj
## 10 ZoEaEvLYb1n2sOq
## 11 VHMLBKoKgIrUVDU
## 12 zBIq5rxdHJRwDNY
## 13 bvBOhCH3iADSZry
## 14 ZnxXDlPa8U1GXgE
## 15 1IF53ai7Xc0U56Y
## 16 z9Y9gH1T5YWrNuG
## 17 wCURE6d3bPkepu2
## 18 iCRJl6heRkivqQ3
## 19 zVJPv84UY57bAof
## 20 WRmjgnKYAwPKWDb
## 21 rGa61gmuvPhdLxV
## 22 7JYdWkrLSPkdwr4

viz: distribution of level counts

jpal = colorRampPalette(brewer.pal(8,'Dark2'))(22)

p1.gd %>% count(inverter) %>% plot_ly(y = ~fct_reorder(inverter,n), x = ~n, color = ~inverter, colors = jpal) %>% add_bars(hoverinfo = 'text', text = ~n) %>% hide_legend() %>% layout(
    title = 'Source Key Counts',
    xaxis = list(title = ''),
    yaxis = list(title = '')
    ) 
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Sanity Check - PASS - Pretty much a uniform distribution

EDA: Numeric Vars

viz bivariate numeric distribution

DataExplorer::plot_boxplot(p1.gd %>% select(where(is.numeric)), by = 'daily_yield')

viz: numeric univariate distributions

names.numeric = p1.gd %>% select(where(is.numeric)) %>% names

p1.gd %>% dlookr::plot_normality(
  names.numeric[1],
  names.numeric[2],
  names.numeric[3]
  )

The 0s for ac/dc power and daily_yield indicate the inverter was powered off for a ‘rest day’ / maintenance 

viz: numeric univariate distributions

DataExplorer::plot_density(p1.gd %>% select(where(is.numeric)))

Looking at daily yield, we can infer that that on a daily basis most inverters are either having cleaning/maintenance done, or are producing >= 6.5k units of energy. Looking at total yield, we can infer that for this entire period, most inverters produced around slightly under 6.5 million units of energy, or 7.25 million units of energy

viz: distributions by ‘inverter’ factor

p1.gd %>% mutate(inverter = fct_reorder(.f = inverter, .x = daily_yield, .fun = median, .desc = TRUE)) %>%
  plot_ly(y = ~inverter, x = ~daily_yield, color = ~inverter, colors = jpal) %>% add_boxplot()%>%
  hide_legend() %>% layout(xaxis = list(title = ''), yaxis = list(title = ''), title = 'Distribution of Daily Yield by Inverter')
p1.gd %>% mutate(inverter = fct_reorder(.f = inverter, .x = ac_power, .fun = median, .desc = TRUE)) %>%
  plot_ly(y = ~inverter, x = ~ac_power, color = ~inverter, colors = jpal) %>% add_boxplot()%>%
  hide_legend() %>% layout(xaxis = list(title = ''), yaxis = list(title = ''), title = 'Distribution of AC Power by Inverter')
p1.gd %>% mutate(inverter = fct_reorder(.f = inverter, .x = dc_power, .fun = median, .desc = TRUE)) %>%
  plot_ly(y = ~inverter, x = ~dc_power, color = ~inverter, colors = jpal) %>% add_boxplot()%>%
  hide_legend() %>% layout(xaxis = list(title = ''), yaxis = list(title = ''), title = 'Distribution of DC Power by Inverter')

viz: cleaning/maintenance days

p1.gd %>% arrange(date_time) %>% group_nest(inverter) %>% unnest

p1.gd.plots = p1.gd %>% arrange(date_time) %>% group_nest(inverter) %>% mutate(
  plots = map2(
    .x = data,
    .y = inverter,
    ~ggplot(data = .x, aes(date_time, daily_yield, color = daily_yield == 0, group = 1)) +
      geom_line(size = 1.2) + scale_color_manual(values = c('black','red')) + ggtitle(paste0('Inverter: ', .y))
  ))

p1.gd.plots$plots

correlations: viz

p1.gd %>% dlookr::plot_correlate()

since dc and ac power are just perfectly convertible (like fahrenheit and celsius), they have a perfect correlation

EDA: Time Series Viz

Anomoly Plot

library(anomalize)
# anomalize(data, target, method = c("iqr", "gesd"), alpha = 0.05, max_anoms = 0.2, verbose = FALSE)

# alpha: Controls the width of the "normal" range. Lower values are more conservative while higher values are less prone to incorrectly classifying "normal" observations.
# max_anoms: The maximum percent of anomalies permitted to be identified.

p1.gd.anomalize = p1.gd %>% arrange(date_time) %>% 
  mutate(inverter = fct_reorder(inverter, -daily_yield)) %>% 
  group_by(inverter) %>%
  time_decompose(daily_yield, method = 'twitter', merge = TRUE) %>%
  anomalize(remainder, alpha = 0.05, method = 'gesd') %>%
  time_recompose()

  p1.gd.anomalize %>%
    plot_anomalies(
      ncol = 1,
      alpha_dots = 0.5,
      alpha_circles = 0.5,
      size_circles = 1.5,
      time_recomposed = TRUE,
      alpha_ribbon = 0.05
      ) + scale_y_continuous(labels = comma) +
    labs(x = '', y = 'daily_yield', title = 'daily_yield')

Something abnormal clearly happened between around May 18th to the end of May. We should investigate the inverters that were affected.

Power Generation Inverter data set: Plant 2

Get and Clean Data

p2.gd = read_csv('Plant_2_Generation_Data.csv') %>%
  slice_sample(prop = 1) %>% #!!<NOTE>temp, working with a sample of datset for speed purposes
  clean_names() %>%  #lowercase
  select(sort(tidyselect::peek_vars())) %>% #sort cols alphabetically
  select(where(is.POSIXct), where(is.factor),where(is.character),where(is.numeric)) #sort cols by data type

#OlsonNames()
#https://stackoverflow.com/questions/41479008/what-is-the-correct-tz-database-time-zone-for-india

p2.gd = p2.gd %>% mutate(
  source_key = factor(p2.gd$source_key),
) %>% rename(inverter = source_key)

p2.gd$plant_id = NULL

glimpse structure, and sample rows

p2.gd %>% slice_sample(n = 10) %>% DT::datatable()
p2.gd %>% glimpse()
## Rows: 67,698
## Columns: 6
## $ date_time   <dttm> 2020-05-19 06:15:00, 2020-06-07 18:15:00, 2020-06-08 1...
## $ inverter    <fct> mqwcsP2rE7J0TFp, V94E5Ben1TlhnDV, vOuJvMaM2sgwLmb, 4UPU...
## $ ac_power    <dbl> 48.446667, 55.900000, 996.786667, 510.560000, 72.246667...
## $ daily_yield <dbl> 7.8000000, 1864.8000000, 7536.7333333, 470.7333333, 19....
## $ dc_power    <dbl> 50.080000, 57.753333, 1020.840000, 520.833333, 74.48000...
## $ total_yield <dbl> 593612056, 1412242458, 2374225, 2644516, 2397742, 59377...

check missing values

p2.gd %>% miss_var_summary()
## # A tibble: 6 x 3
##   variable    n_miss pct_miss
##   <chr>        <int>    <dbl>
## 1 date_time        0        0
## 2 inverter         0        0
## 3 ac_power         0        0
## 4 daily_yield      0        0
## 5 dc_power         0        0
## 6 total_yield      0        0

EDA: Factor Vars

counts each factor’s unique levels

sapply(p2.gd %>% select(where(is.factor)), n_unique) %>% as.data.frame()
##           .
## inverter 22

reference: names of unique levels

sapply(p2.gd %>% select(where(is.factor)), unique) %>% as.data.frame() %>% arrange()
##           inverter
## 1  mqwcsP2rE7J0TFp
## 2  V94E5Ben1TlhnDV
## 3  vOuJvMaM2sgwLmb
## 4  4UPUqMRk7TRMgml
## 5  rrq4fwE8jgrTyWY
## 6  LYwnQax7tkwH5Cb
## 7  IQ2d7wF4YD8zU1Q
## 8  81aHJ1q11NBPMrL
## 9  NgDl19wMapZy17u
## 10 Mx2yZCDsyf6DPfv
## 11 oZ35aAeoifZaQzV
## 12 Quc1TzYxW2pYoWX
## 13 Qf4GUc1pJu5T6c6
## 14 xoJJ8DcxJEcupym
## 15 oZZkBaNadn6DNKz
## 16 LlT2YUhhzqhg5Sw
## 17 xMbIugepa2P7lBB
## 18 Et9kgGMDl729KT4
## 19 9kRcWv60rDACzjR
## 20 WcxssY2VbP4hApt
## 21 PeE6FRyGXUgsRhN
## 22 q49J1IKaHRwDQnt

viz: distribution of level counts

jpal = colorRampPalette(brewer.pal(8,'Dark2'))(22)

p2.gd %>% count(inverter) %>% plot_ly(y = ~fct_reorder(inverter,n), x = ~n, color = ~inverter, colors = jpal) %>% add_bars(hoverinfo = 'text', text = ~n) %>% hide_legend() %>% layout(
    title = 'Source Key Counts',
    xaxis = list(title = ''),
    yaxis = list(title = '')
    ) 

Sanity Check - FAIL - We should find out why the bottom 4 inverters have fewer counts. Were they totally out of commission during the abnormal date range aforementioned?

EDA: Numeric Vars

viz bivariate numeric distribution

DataExplorer::plot_boxplot(p2.gd %>% select(where(is.numeric)), by = 'daily_yield')

viz: numeric univariate distributions

names.numeric = p2.gd %>% select(where(is.numeric)) %>% names

p2.gd %>% dlookr::plot_normality(
  names.numeric[1],
  names.numeric[2],
  names.numeric[3],
  names.numeric[4]
  )

viz: numeric univariate distributions

DataExplorer::plot_density(p2.gd %>% select(where(is.numeric)))

the inverters clearly have a wide distribution of daily_yield and total yield. Looking at daily yield, we can infer that that on a daily basis most inverters are either having cleaning/maintenance done, or producing around 3.7k, 7.5k, 9.0k units of energy Looking at total yield, we can infer that for this entire period, most inverters produced around slightly under 1.3 billion units of energy, or 1.8 billion units of energy. This is a lot more than Plant 1.

viz: distributions by ‘inverter’ factor

p2.gd %>% mutate(inverter = fct_reorder(.f = inverter, .x = daily_yield, .fun = median, .desc = TRUE)) %>%
  plot_ly(y = ~inverter, x = ~daily_yield, color = ~inverter, colors = jpal) %>% add_boxplot()%>%
  hide_legend() %>% layout(xaxis = list(title = ''), yaxis = list(title = ''), title = 'Distribution of Daily Yield by Inverter')
p2.gd %>% mutate(inverter = fct_reorder(.f = inverter, .x = ac_power, .fun = median, .desc = TRUE)) %>%
  plot_ly(y = ~inverter, x = ~ac_power, color = ~inverter, colors = jpal) %>% add_boxplot()%>%
  hide_legend() %>% layout(xaxis = list(title = ''), yaxis = list(title = ''), title = 'Distribution of AC Power by Inverter')
p2.gd %>% mutate(inverter = fct_reorder(.f = inverter, .x = dc_power, .fun = median, .desc = TRUE)) %>%
  plot_ly(y = ~inverter, x = ~dc_power, color = ~inverter, colors = jpal) %>% add_boxplot()%>%
  hide_legend() %>% layout(xaxis = list(title = ''), yaxis = list(title = ''), title = 'Distribution of DC Power by Inverter')

We should probably investigate why yellowish (Quc1TzYxW2pYoWX), pink (LYwnQax7tkwH5Cb) and purple (Et9kgGMDl729KT4) have so many outliers

viz: cleaning/maintenance days

p2.gd %>% arrange(date_time) %>% group_nest(inverter) %>% unnest

p2.gd.plots = p2.gd %>% arrange(date_time) %>% group_nest(inverter) %>% mutate(
  plots = map2(
    .x = data,
    .y = inverter,
    ~ggplot(data = .x, aes(date_time, daily_yield, color = daily_yield == 0, group = 1)) +
      geom_line(size = 1.2) + scale_color_manual(values = c('black','red')) + ggtitle(paste0('Inverter: ', .y))
  ))

p2.gd.plots$plots

correlations: viz

p2.gd %>% dlookr::plot_correlate()

Since dc and ac power are perfectly convertible (like fahrenheit and celsius), they have a perfect correlation

EDA: Time Series Viz

Anomoly Plot

library(anomalize)
# anomalize(data, target, method = c("iqr", "gesd"), alpha = 0.05, max_anoms = 0.2, verbose = FALSE)

# alpha: Controls the width of the "normal" range. Lower values are more conservative while higher values are less prone to incorrectly classifying "normal" observations.
# max_anoms: The maximum percent of anomalies permitted to be identified.

p2.gd.anomalize = p2.gd %>% arrange(date_time) %>% 
  mutate(inverter = fct_reorder(inverter, -daily_yield)) %>% 
  group_by(inverter) %>%
  time_decompose(daily_yield, method = 'twitter', merge = TRUE) %>%
  anomalize(remainder, alpha = 0.05, method = 'gesd') %>%
  time_recompose()

  p2.gd.anomalize %>%
    plot_anomalies(
      ncol = 1,
      alpha_dots = 0.5,
      alpha_circles = 0.5,
      size_circles = 1.5,
      time_recomposed = TRUE,
      alpha_ribbon = 0.05
      ) + scale_y_continuous(labels = comma) +
    labs(x = '', y = 'daily_yield', title = 'daily_yield')

Abnormal activity occurred from May 18th to the end of May. We hypothesize this was an exogenous factor, as both solar power plants were affected. In the case of power plant 2, 4 inverters were completely down.

DATA DICTIONARY for Sensor Reading data sets

  1. IRRADIATION: Amount of irradiation for the 15 minute interval.
  2. DATE_TIME: Date and time for each observation. Observations recorded at 15 minute intervals.
  3. PLANT_ID: Plant ID - this will be common for the entire file.
  4. SOURCE_KEY: Stands for the sensor panel id. This will be common for the entire file because there’s only one sensor panel for the plant.
  5. MODULE_TEMPERATURE: There’s a module (solar panel) attached to the sensor panel. This is the temperature reading for that module.
  6. AMBIENT_TEMPERATURE: This is the ambient temperature at the plant.

Sensor Data Set: Plant 1

Get and Clean Data

p1.ws = read_csv('Plant_1_Weather_Sensor_Data.csv') %>%
  slice_sample(prop = 1) %>% #!!<NOTE>temp, working with a sample of datset for speed purposes
  clean_names() %>%  #lowercase
  mutate(across(where(is.character),factor)) %>% 
  select(sort(tidyselect::peek_vars())) %>% #sort cols alphabetically
  select(date_time, where(is.factor), where(is.numeric)) %>% #sort cols by data type
  arrange(date_time)

p1.ws$plant_id = NULL
p1.ws$source_key = NULL

sample rows

p1.ws %>% slice_sample(n = 10) %>% DT::datatable()

glimpse structure

p1.ws %>% glimpse()
## Rows: 3,182
## Columns: 4
## $ date_time           <dttm> 2020-05-15 00:00:00, 2020-05-15 00:15:00, 2020...
## $ ambient_temperature <dbl> 25.18432, 25.08459, 24.93575, 24.84613, 24.6215...
## $ irradiation         <dbl> 0.0000000000, 0.0000000000, 0.0000000000, 0.000...
## $ module_temperature  <dbl> 22.85751, 22.76167, 22.59231, 22.36085, 22.1654...

check missing values

p1.ws %>% miss_var_summary
## # A tibble: 4 x 3
##   variable            n_miss pct_miss
##   <chr>                <int>    <dbl>
## 1 date_time                0        0
## 2 ambient_temperature      0        0
## 3 irradiation              0        0
## 4 module_temperature       0        0

EDA: Numeric Vars

viz: numeric univariate distributions

p1.ws %>% select(where(is.numeric)) %>% dlookr::plot_normality()

viz: numeric univariate outlier check

p1.ws %>% select(where(is.numeric)) %>% dlookr::diagnose_outlier()
## # A tibble: 3 x 6
##   variables     outliers_cnt outliers_ratio outliers_mean with_mean without_mean
##   <chr>                <int>          <dbl>         <dbl>     <dbl>        <dbl>
## 1 ambient_temp~            0         0             NaN       25.5         25.5  
## 2 irradiation              2         0.0629          1.19     0.228        0.228
## 3 module_tempe~            0         0             NaN       31.1         31.1
p1.ws %>% select(where(is.numeric)) %>% DataExplorer::plot_density()

correlations: viz

p1.ws %>% dlookr::plot_correlate()

p1.ws %>% select(where(is.numeric)) %>% GGally::ggcorr(low = '#990000', mid = '#E0E0E0', high = '#009900', label = TRUE)

irradiation and module_temperature have a perfect correlation, which is to be expected.

viz: Time Series w/IQR percentiles

ggplotly(p1.ws %>% ggplot(aes(date_time, ambient_temperature)) +
  geom_line(size = 1.2) +
  geom_hline(yintercept = quantile(p1.ws$ambient_temperature, 0.25), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1) +
  geom_hline(yintercept = quantile(p1.ws$ambient_temperature, 0.50), lty = 'dashed', alpha = 0.7, color = 'blue', size = 1.1) +
  geom_hline(yintercept = quantile(p1.ws$ambient_temperature, 0.75), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1)
) %>% layout(title = 'ambient_temperature')
ggplotly(p1.ws %>% ggplot(aes(date_time, irradiation)) +
  geom_line(size = 1.2) +
  geom_hline(yintercept = quantile(p1.ws$irradiation, 0.25), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1) +
  geom_hline(yintercept = quantile(p1.ws$irradiation, 0.50), lty = 'dashed', alpha = 0.7, color = 'blue', size = 1.1) +
  geom_hline(yintercept = quantile(p1.ws$irradiation, 0.75), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1)
) %>% layout(title = 'irradiation')
ggplotly(p1.ws %>% ggplot(aes(date_time, module_temperature)) +
  geom_line(size = 1.2) +
  geom_hline(yintercept = quantile(p1.ws$module_temperature, 0.25), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1) +
  geom_hline(yintercept = quantile(p1.ws$module_temperature, 0.50), lty = 'dashed', alpha = 0.7, color = 'blue', size = 1.1) +
  geom_hline(yintercept = quantile(p1.ws$module_temperature, 0.75), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1)
) %>% layout(title = 'module_temperature')
p1.ws %>% pivot_longer(ambient_temperature:module_temperature) %>% ggplot(aes(date_time, value, color = name)) + geom_line(size = 1.1) + facet_wrap(~name, ncol = 1, scales = 'free_y') + theme(legend.position = 'none')

Ambient temperature (red) peaks right before May 18th. Interestingly, irradiation and module temperature do not peak on this date. This suggests an exogenous factor was at play, We should investigate.

Anomoly Plot

library(anomalize)
# anomalize(data, target, method = c("iqr", "gesd"), alpha = 0.05, max_anoms = 0.2, verbose = FALSE)

# alpha: Controls the width of the "normal" range. Lower values are more conservative while higher values are less prone to incorrectly classifying "normal" observations.
# max_anoms: The maximum percent of anomalies permitted to be identified.

  ggplotly(
  p1.ws %>% arrange(date_time) %>% 
  time_decompose(ambient_temperature, method = 'twitter', merge = TRUE) %>%
  anomalize(remainder, alpha = 0.15, method = 'gesd') %>%
  time_recompose() %>%
    plot_anomalies(
      ncol = 2,
      alpha_dots = 0.5,
      alpha_circles = 0.5,
      size_circles = 2,
      time_recomposed = TRUE,
      alpha_ribbon = 0.05
      ) + scale_y_continuous(labels = comma) +
    labs(x = '', y = 'ambient_temperature', title = 'ambient_temperature')
  )
## Converting from tbl_df to tbl_time.
## Auto-index message: index = date_time
## frequency = 96 minutes
## median_span = 1591 minutes
  ggplotly(
  p1.ws %>% arrange(date_time) %>% 
  time_decompose(irradiation, method = 'twitter', merge = TRUE) %>%
  anomalize(remainder, alpha = 0.05, method = 'gesd') %>%
  time_recompose() %>%
    plot_anomalies(
      ncol = 2,
      alpha_dots = 0.5,
      alpha_circles = 0.5,
      size_circles = 2,
      time_recomposed = TRUE,
      alpha_ribbon = 0.05
      ) + scale_y_continuous(labels = comma) +
    labs(x = '', y = 'irradiation', title = 'irradiation')
  )
## Converting from tbl_df to tbl_time.
## Auto-index message: index = date_time
## frequency = 96 minutes
## median_span = 1591 minutes
  ggplotly(
  p1.ws %>% arrange(date_time) %>% 
  time_decompose(module_temperature, method = 'twitter', merge = TRUE) %>%
  anomalize(remainder, alpha = 0.05, method = 'gesd') %>%
  time_recompose() %>%
    plot_anomalies(
      ncol = 2,
      alpha_dots = 0.5,
      alpha_circles = 0.5,
      size_circles = 2,
      time_recomposed = TRUE,
      alpha_ribbon = 0.05
      ) + scale_y_continuous(labels = comma) +
    labs(x = '', y = 'module_temperature', title = 'module_temperature')
  )
## Converting from tbl_df to tbl_time.
## Auto-index message: index = date_time
## frequency = 96 minutes
## median_span = 1591 minutes

For all features, there are a ton of BOTH lower and upper outliers beginning around May 16th to the end of May, consistent with previous observations. As a whole, things are heating up in Plant 1.

Sensor Data Set: Plant 2

Get and Clean Data

p2.ws = read_csv('Plant_2_Weather_Sensor_Data.csv') %>%
  slice_sample(prop = 1) %>% #!!<NOTE>temp, working with a sample of datset for speed purposes
  clean_names() %>%  #lowercase
  mutate(across(where(is.character),factor)) %>% 
  select(sort(tidyselect::peek_vars())) %>% #sort cols alphabetically
  select(date_time, where(is.factor), where(is.numeric)) %>% #sort cols by data type
  arrange(date_time)

p2.ws$plant_id = NULL
p2.ws$source_key = NULL

sample rows

p2.ws %>% slice_sample(n = 10) %>% DT::datatable()

glimpse structure

p2.ws %>% glimpse()
## Rows: 3,259
## Columns: 4
## $ date_time           <dttm> 2020-05-15 00:00:00, 2020-05-15 00:15:00, 2020...
## $ ambient_temperature <dbl> 27.00476, 26.88081, 26.68206, 26.50059, 26.5961...
## $ irradiation         <dbl> 0.000000000, 0.000000000, 0.000000000, 0.000000...
## $ module_temperature  <dbl> 25.06079, 24.42187, 24.42729, 24.42068, 25.0882...

check missing values

p2.ws %>% miss_var_summary
## # A tibble: 4 x 3
##   variable            n_miss pct_miss
##   <chr>                <int>    <dbl>
## 1 date_time                0        0
## 2 ambient_temperature      0        0
## 3 irradiation              0        0
## 4 module_temperature       0        0

EDA: Numeric Vars

viz: numeric univariate distributions

p2.ws %>% select(where(is.numeric)) %>% dlookr::plot_normality()

viz: numeric univariate outlier check

p2.ws %>% select(where(is.numeric)) %>% dlookr::diagnose_outlier()
## # A tibble: 3 x 6
##   variables     outliers_cnt outliers_ratio outliers_mean with_mean without_mean
##   <chr>                <int>          <dbl>         <dbl>     <dbl>        <dbl>
## 1 ambient_temp~            0         0             NaN       28.1         28.1  
## 2 irradiation              1         0.0307          1.10     0.233        0.232
## 3 module_tempe~            2         0.0614         66.3     32.8         32.8
p2.ws %>% select(where(is.numeric)) %>% DataExplorer::plot_density()

correlations: viz

p2.ws %>% dlookr::plot_correlate()

p2.ws %>% select(where(is.numeric)) %>% GGally::ggcorr(low = '#990000', mid = '#E0E0E0', high = '#009900', label = TRUE)

viz: Time Series w/IQR percentiles

ggplotly(p2.ws %>% ggplot(aes(date_time, ambient_temperature)) +
  geom_line(size = 1.2) +
  geom_hline(yintercept = quantile(p2.ws$ambient_temperature, 0.25), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1) +
  geom_hline(yintercept = quantile(p2.ws$ambient_temperature, 0.50), lty = 'dashed', alpha = 0.7, color = 'blue', size = 1.1) +
  geom_hline(yintercept = quantile(p2.ws$ambient_temperature, 0.75), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1)
) %>% layout(title = 'ambient_temperature')
ggplotly(p2.ws %>% ggplot(aes(date_time, irradiation)) +
  geom_line(size = 1.2) +
  geom_hline(yintercept = quantile(p2.ws$irradiation, 0.25), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1) +
  geom_hline(yintercept = quantile(p2.ws$irradiation, 0.50), lty = 'dashed', alpha = 0.7, color = 'blue', size = 1.1) +
  geom_hline(yintercept = quantile(p2.ws$irradiation, 0.75), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1)
) %>% layout(title = 'irradiation')
ggplotly(p2.ws %>% ggplot(aes(date_time, module_temperature)) +
  geom_line(size = 1.2) +
  geom_hline(yintercept = quantile(p2.ws$module_temperature, 0.25), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1) +
  geom_hline(yintercept = quantile(p2.ws$module_temperature, 0.50), lty = 'dashed', alpha = 0.7, color = 'blue', size = 1.1) +
  geom_hline(yintercept = quantile(p2.ws$module_temperature, 0.75), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1)
) %>% layout(title = 'module_temperature')
p2.ws %>% pivot_longer(ambient_temperature:module_temperature) %>% ggplot(aes(date_time, value, color = name)) + geom_line(size = 1.1) + facet_wrap(~name, ncol = 1, scales = 'free_y') + theme(legend.position = 'none')

Unlike Plant 1, for Plant 2, ambient temperature (red) does NOT peak right before May 18th.

Anomoly Plot

library(anomalize)
# anomalize(data, target, method = c("iqr", "gesd"), alpha = 0.05, max_anoms = 0.2, verbose = FALSE)

# alpha: Controls the width of the "normal" range. Lower values are more conservative while higher values are less prone to incorrectly classifying "normal" observations.
# max_anoms: The maximum percent of anomalies permitted to be identified.
  ggplotly(
  p2.ws %>% arrange(date_time) %>% 
  time_decompose(ambient_temperature, method = 'twitter', merge = TRUE) %>%
  anomalize(remainder, alpha = 0.15, method = 'gesd') %>%
  time_recompose() %>%
    plot_anomalies(
      ncol = 2,
      alpha_dots = 0.5,
      alpha_circles = 0.5,
      size_circles = 2,
      time_recomposed = TRUE,
      alpha_ribbon = 0.05
      ) + scale_y_continuous(labels = comma) +
    labs(x = '', y = 'ambient_temperature', title = 'ambient_temperature')
  )
## Converting from tbl_df to tbl_time.
## Auto-index message: index = date_time
## frequency = 96 minutes
## median_span = 1629.5 minutes
  ggplotly(
  p2.ws %>% arrange(date_time) %>% 
  time_decompose(irradiation, method = 'twitter', merge = TRUE) %>%
  anomalize(remainder, alpha = 0.05, method = 'gesd') %>%
  time_recompose() %>%
    plot_anomalies(
      ncol = 2,
      alpha_dots = 0.5,
      alpha_circles = 0.5,
      size_circles = 2,
      time_recomposed = TRUE,
      alpha_ribbon = 0.05
      ) + scale_y_continuous(labels = comma) +
    labs(x = '', y = 'irradiation' , title = 'irradiation')
  )
## Converting from tbl_df to tbl_time.
## Auto-index message: index = date_time
## frequency = 96 minutes
## median_span = 1629.5 minutes
  ggplotly(
  p2.ws %>% arrange(date_time) %>% 
  time_decompose(module_temperature, method = 'twitter', merge = TRUE) %>%
  anomalize(remainder, alpha = 0.05, method = 'gesd') %>%
  time_recompose() %>%
    plot_anomalies(
      ncol = 2,
      alpha_dots = 0.5,
      alpha_circles = 0.5,
      size_circles = 2,
      time_recomposed = TRUE,
      alpha_ribbon = 0.05
      ) + scale_y_continuous(labels = comma) +
    labs(x = '', y = 'module_temperature', title = 'module_temperature')
  )
## Converting from tbl_df to tbl_time.
## Auto-index message: index = date_time
## frequency = 96 minutes
## median_span = 1629.5 minutes

Ambient Temperature: Plant 2 sensors were not affected the same way as Plant 1 sensors. There were mainly ONLY lower outliers, and mostly only on May 19th. Plant 2 ambient temperature was colder, in contradistinction to Plant 1 ambient temperature, which was hotter.