607Assignment3B Dylan Gold

607 Assignment 3B Dylan Gold

Approach

In this assignment we have to work with a data set that has time series. We have to find/generate a data set then use sql or dplyr to calculate the average from year to year, as well as every 6 days I will find something online, maybe ask a LLM to generate one or give me ideas.

I will likely get the data as a csv onto github then use dplyr to create these averages after importing the data into R

Codebase:

I started off by asking a LLM, (GPT 4) to give me a data set that satisfied the conditions for this assignment. The LLM conversation is in the LLM_Conversation3B.txt file. It asked me to be a bit more specific on what kind of data I wanted. Due to the price of electricity going a lot these days, from what I have heard in large part due to data centers being built. I was curious about this so I told the LLM to give me some data on this.

The LLM prompted me to go to www.eia.gov/electricity/wholesalemarkets/csv/nyiso_load_act_hr_YYYY.csv By changing the YYYY to a specific year we could get different years.

The LLM suggested I download the years I wanted then condense it into daily information. This is what I did. I got data from 2022 to 2025. Because the file size is large, I decided to just download directly without saving the file into github.

library(tidyverse)
library(ggplot2)
library(gt)

Get the 2022 Data

url <- "http://www.eia.gov/electricity/wholesalemarkets/csv/nyiso_load_act_hr_2022.csv"

df_22<- read_csv(
  file = url,
  show_col_types = FALSE, 
  skip = 3 # First three lines are not apart of the data
)

head(df_22, 10)

# A tibble: 10 × 16
   UTC Timestamp (Interval Endin…¹ Local Timestamp East…² Local Timestamp East…³
   <dttm>                          <dttm>                 <dttm>                
 1 2022-01-01 06:00:00             2022-01-01 00:00:00    2022-01-01 01:00:00   
 2 2022-01-01 07:00:00             2022-01-01 01:00:00    2022-01-01 02:00:00   
 3 2022-01-01 08:00:00             2022-01-01 02:00:00    2022-01-01 03:00:00   
 4 2022-01-01 09:00:00             2022-01-01 03:00:00    2022-01-01 04:00:00   
 5 2022-01-01 10:00:00             2022-01-01 04:00:00    2022-01-01 05:00:00   
 6 2022-01-01 11:00:00             2022-01-01 05:00:00    2022-01-01 06:00:00   
 7 2022-01-01 12:00:00             2022-01-01 06:00:00    2022-01-01 07:00:00   
 8 2022-01-01 13:00:00             2022-01-01 07:00:00    2022-01-01 08:00:00   
 9 2022-01-01 14:00:00             2022-01-01 08:00:00    2022-01-01 09:00:00   
10 2022-01-01 15:00:00             2022-01-01 09:00:00    2022-01-01 10:00:00   
# ℹ abbreviated names: ¹`UTC Timestamp (Interval Ending)`,
#   ²`Local Timestamp Eastern Time (Interval Beginning)`,
#   ³`Local Timestamp Eastern Time (Interval Ending)`
# ℹ 13 more variables: `Local Date` <date>, `Hour Number` <dbl>,
#   `F - Capital Actual Load (MW)` <dbl>, `C - Central Actual Load (MW)` <dbl>,
#   `I - Dunwoodie Actual Load (MW)` <dbl>,
#   `B - Genessee Actual Load (MW)` <dbl>, …

Get the 2023 Data

url <- "http://www.eia.gov/electricity/wholesalemarkets/csv/nyiso_load_act_hr_2023.csv"

df_23<- read_csv(
  file = url,
  show_col_types = FALSE, 
  skip = 3 
)

head(df_23, 10)

# A tibble: 10 × 16
   UTC Timestamp (Interval Endin…¹ Local Timestamp East…² Local Timestamp East…³
   <dttm>                          <dttm>                 <dttm>                
 1 2023-01-01 06:00:00             2023-01-01 00:00:00    2023-01-01 01:00:00   
 2 2023-01-01 07:00:00             2023-01-01 01:00:00    2023-01-01 02:00:00   
 3 2023-01-01 08:00:00             2023-01-01 02:00:00    2023-01-01 03:00:00   
 4 2023-01-01 09:00:00             2023-01-01 03:00:00    2023-01-01 04:00:00   
 5 2023-01-01 10:00:00             2023-01-01 04:00:00    2023-01-01 05:00:00   
 6 2023-01-01 11:00:00             2023-01-01 05:00:00    2023-01-01 06:00:00   
 7 2023-01-01 12:00:00             2023-01-01 06:00:00    2023-01-01 07:00:00   
 8 2023-01-01 13:00:00             2023-01-01 07:00:00    2023-01-01 08:00:00   
 9 2023-01-01 14:00:00             2023-01-01 08:00:00    2023-01-01 09:00:00   
10 2023-01-01 15:00:00             2023-01-01 09:00:00    2023-01-01 10:00:00   
# ℹ abbreviated names: ¹`UTC Timestamp (Interval Ending)`,
#   ²`Local Timestamp Eastern Time (Interval Beginning)`,
#   ³`Local Timestamp Eastern Time (Interval Ending)`
# ℹ 13 more variables: `Local Date` <date>, `Hour Number` <dbl>,
#   `F - Capital Actual Load (MW)` <dbl>, `C - Central Actual Load (MW)` <dbl>,
#   `I - Dunwoodie Actual Load (MW)` <dbl>,
#   `B - Genessee Actual Load (MW)` <dbl>, …

Get the 2024 Data

url <- "http://www.eia.gov/electricity/wholesalemarkets/csv/nyiso_load_act_hr_2024.csv"

df_24<- read_csv(
  file = url,
  show_col_types = FALSE, 
  skip = 3 
)

head(df_24, 10)

# A tibble: 10 × 16
   UTC Timestamp (Interval Endin…¹ Local Timestamp East…² Local Timestamp East…³
   <dttm>                          <dttm>                 <dttm>                
 1 2024-01-01 06:00:00             2024-01-01 00:00:00    2024-01-01 01:00:00   
 2 2024-01-01 07:00:00             2024-01-01 01:00:00    2024-01-01 02:00:00   
 3 2024-01-01 08:00:00             2024-01-01 02:00:00    2024-01-01 03:00:00   
 4 2024-01-01 09:00:00             2024-01-01 03:00:00    2024-01-01 04:00:00   
 5 2024-01-01 10:00:00             2024-01-01 04:00:00    2024-01-01 05:00:00   
 6 2024-01-01 11:00:00             2024-01-01 05:00:00    2024-01-01 06:00:00   
 7 2024-01-01 12:00:00             2024-01-01 06:00:00    2024-01-01 07:00:00   
 8 2024-01-01 13:00:00             2024-01-01 07:00:00    2024-01-01 08:00:00   
 9 2024-01-01 14:00:00             2024-01-01 08:00:00    2024-01-01 09:00:00   
10 2024-01-01 15:00:00             2024-01-01 09:00:00    2024-01-01 10:00:00   
# ℹ abbreviated names: ¹`UTC Timestamp (Interval Ending)`,
#   ²`Local Timestamp Eastern Time (Interval Beginning)`,
#   ³`Local Timestamp Eastern Time (Interval Ending)`
# ℹ 13 more variables: `Local Date` <date>, `Hour Number` <dbl>,
#   `F - Capital Actual Load (MW)` <dbl>, `C - Central Actual Load (MW)` <dbl>,
#   `I - Dunwoodie Actual Load (MW)` <dbl>,
#   `B - Genessee Actual Load (MW)` <dbl>, …

Get the 2025 Data, I notice it only goes 6 months in.

url <- "http://www.eia.gov/electricity/wholesalemarkets/csv/nyiso_load_act_hr_2025.csv"

df_25<- read_csv(
  file = url,
  show_col_types = FALSE, 
  skip = 3 
)

head(df_25, 10)

# A tibble: 10 × 16
   UTC Timestamp (Interval Endin…¹ Local Timestamp East…² Local Timestamp East…³
   <dttm>                          <dttm>                 <dttm>                
 1 2025-01-01 06:00:00             2025-01-01 00:00:00    2025-01-01 01:00:00   
 2 2025-01-01 07:00:00             2025-01-01 01:00:00    2025-01-01 02:00:00   
 3 2025-01-01 08:00:00             2025-01-01 02:00:00    2025-01-01 03:00:00   
 4 2025-01-01 09:00:00             2025-01-01 03:00:00    2025-01-01 04:00:00   
 5 2025-01-01 10:00:00             2025-01-01 04:00:00    2025-01-01 05:00:00   
 6 2025-01-01 11:00:00             2025-01-01 05:00:00    2025-01-01 06:00:00   
 7 2025-01-01 12:00:00             2025-01-01 06:00:00    2025-01-01 07:00:00   
 8 2025-01-01 13:00:00             2025-01-01 07:00:00    2025-01-01 08:00:00   
 9 2025-01-01 14:00:00             2025-01-01 08:00:00    2025-01-01 09:00:00   
10 2025-01-01 15:00:00             2025-01-01 09:00:00    2025-01-01 10:00:00   
# ℹ abbreviated names: ¹`UTC Timestamp (Interval Ending)`,
#   ²`Local Timestamp Eastern Time (Interval Beginning)`,
#   ³`Local Timestamp Eastern Time (Interval Ending)`
# ℹ 13 more variables: `Local Date` <date>, `Hour Number` <dbl>,
#   `F - Capital Actual Load (MW)` <dbl>, `C - Central Actual Load (MW)` <dbl>,
#   `I - Dunwoodie Actual Load (MW)` <dbl>,
#   `B - Genessee Actual Load (MW)` <dbl>, …

Now we can combine them into a single dataframe with Rbind, they have the same columns so it should be fine. I see in the R global environment that df now has 34656 values, which is the total of all 4 data frames (4* 8664)

df_22to25 <- df_22 %>% 
  rbind(df_23, .) %>% 
  rbind(df_24, .) %>%  # . is like the pipe input
  rbind(df_25, .) #We can pipe the rbind to bind them all at once.
  
#
head(df_22to25)

# A tibble: 6 × 16
  UTC Timestamp (Interval Ending…¹ Local Timestamp East…² Local Timestamp East…³
  <dttm>                           <dttm>                 <dttm>                
1 2025-01-01 06:00:00              2025-01-01 00:00:00    2025-01-01 01:00:00   
2 2025-01-01 07:00:00              2025-01-01 01:00:00    2025-01-01 02:00:00   
3 2025-01-01 08:00:00              2025-01-01 02:00:00    2025-01-01 03:00:00   
4 2025-01-01 09:00:00              2025-01-01 03:00:00    2025-01-01 04:00:00   
5 2025-01-01 10:00:00              2025-01-01 04:00:00    2025-01-01 05:00:00   
6 2025-01-01 11:00:00              2025-01-01 05:00:00    2025-01-01 06:00:00   
# ℹ abbreviated names: ¹`UTC Timestamp (Interval Ending)`,
#   ²`Local Timestamp Eastern Time (Interval Beginning)`,
#   ³`Local Timestamp Eastern Time (Interval Ending)`
# ℹ 13 more variables: `Local Date` <date>, `Hour Number` <dbl>,
#   `F - Capital Actual Load (MW)` <dbl>, `C - Central Actual Load (MW)` <dbl>,
#   `I - Dunwoodie Actual Load (MW)` <dbl>,
#   `B - Genessee Actual Load (MW)` <dbl>, …

Now we can clean the data. There are a lot of columns giving us information on different places. I will just get some that I find interesting. We also are going to combine the hourly data to daily data. We can do this by grouping by the local date and summing up values in this So I will select the local date, as well as New York City load and Capital Actual Load

df <- select(df_22to25, "Local Date", "J - New York City Actual Load (MW)", "F - Capital Actual Load (MW)" ) %>%
  rename(Date = "Local Date", NYC_Load_MW = "J - New York City Actual Load (MW)",Capital_Load_MW= "F - Capital Actual Load (MW)")
head(df)

# A tibble: 6 × 3
  Date       NYC_Load_MW Capital_Load_MW
  <date>           <dbl>           <dbl>
1 2025-01-01       4680.           1163.
2 2025-01-01       4518.           1154.
3 2025-01-01       4400.           1122.
4 2025-01-01       4292.           1104.
5 2025-01-01       4221.           1102.
6 2025-01-01       4276.           1117.

  # New columns will be NYC_Load(MW) and Capital_Load(MW) which are the sum of the old ones. We can do this with summarise
df_daily <- df %>% group_by(Date) %>% 
  summarize(
    NYC_Load_MW = sum(NYC_Load_MW),
    Capital_Load_MW = sum(Capital_Load_MW)
  )
tail(df_daily)

# A tibble: 6 × 3
  Date       NYC_Load_MW Capital_Load_MW
  <date>           <dbl>           <dbl>
1 2025-06-18     145166.          32286.
2 2025-06-19     164145.          37476.
3 2025-06-20     155307.          31819.
4 2025-06-21     154500.          31061.
5 2025-06-22     166034.          36609.
6 2025-06-23     151965.          29771.

df_daily

# A tibble: 1,204 × 3
   Date       NYC_Load_MW Capital_Load_MW
   <date>           <dbl>           <dbl>
 1 2022-01-01     111514.          29764.
 2 2022-01-02     112472.          31662.
 3 2022-01-03     134730.          36325.
 4 2022-01-04     139265.          36847.
 5 2022-01-05     134372.          35364.
 6 2022-01-06     132235.          34393.
 7 2022-01-07     137456.          35448.
 8 2022-01-08     133163.          35288.
 9 2022-01-09     117183.          32005.
10 2022-01-10     138840.          35872.
# ℹ 1,194 more rows

We now have a data frame containing the daily NYC Load and Captial_Load in megawatts from 2022 to 2025.

Lets now get the year to date average. From what I looked up that is just the daily average over the course of a calendar/fiscal year. Our date starts on on 1-1 so we just need to group by the same year then get the mean

df_YTD <- df_daily %>%
  group_by(Year = format(Date, "%Y")) %>%
  summarize(
    NYC_Load_YearAvg = mean(NYC_Load_MW),
    Capital_Load_YearAvg= mean(Capital_Load_MW)
    ) 
  #Also rename the groupby column
df_YTD

# A tibble: 4 × 3
  Year  NYC_Load_YearAvg Capital_Load_YearAvg
  <chr>            <dbl>                <dbl>
1 2022           134740.               32345.
2 2023           127525.               29899.
3 2024           134573.               30768.
4 2025           129013.               30778.

We can also plot this

ggplot(df_YTD, aes(x = Year,y = NYC_Load_YearAvg)) + 
  geom_point() +
  labs(y = "Load (MegaWatts)", title = "Average Daily NYC Load Per Year 2022-2025")

ggplot(df_YTD, aes(x = Year,y = Capital_Load_YearAvg)) + 
  geom_point() +
  labs(y = "Load (MegaWatts)", title = "Average Daily Capital Load Per Year 2022-2025")

We can now move on to creating a 6 day moving average. We need to get the average for every 6 days

# To do this, from what I have seen we can create a new column to group.
#We generate this column with the integer divison operator %/%. By doing this we can see which group the current row is on based on its row index

segment = 6 #Value for how often we want to segment the data

df_6dayAvg <- df_daily %>%
  mutate(group = (row_number() %/% segment)) %>% #Create the grouping row
  group_by(group) %>%
  summarize(
    Date = min(Date), #Get the first date as the date for the 6 day grouping
    NYC_Load_SixAvg = mean(NYC_Load_MW),
    Capital_Load_SixAvg= mean(Capital_Load_MW)
  ) %>% 
  select(-group) # Remove the grouping column we made for this
  
head(df_6dayAvg)

# A tibble: 6 × 3
  Date       NYC_Load_SixAvg Capital_Load_SixAvg
  <date>               <dbl>               <dbl>
1 2022-01-01         126471.              33992.
2 2022-01-06         134562.              35303.
3 2022-01-12         135406.              36533.
4 2022-01-18         135273.              35831.
5 2022-01-24         139994.              37039.
6 2022-01-30         139791.              37213.

We now have a data frame with an average of every 6 days for the load for Captial and NYC.

I will create graphs showing this, Our 6 day average NYC graph

ggplot(df_6dayAvg, aes(x = Date,y = NYC_Load_SixAvg)) + 
  geom_point() +
  geom_line() +
  labs(y = "Load (MegaWatts)", title = "6 Day Moving Average NYC Load 2022-2025")

Our 6 day average Capital graph

ggplot(df_6dayAvg, aes(x = Date,y = Capital_Load_SixAvg)) + 
  geom_point() +
  geom_line() +
  labs(y = "Load (MegaWatts)", title = "Average Daily Capital Load Per Year 2022-2025")

We can see both these graphs have a similar trend, spiking up towards the middle of the year each year to some extent. Something to note is how we don’t have a full year of 2025 data.

Conclusion

In this assignment we had to work with data frames with date attributes to create different data frames based on time intervals. I found the problem at least when using dplyr is figuring out how to group the data. Using the integer division operator is a nice way of segmenting the data into different groups but depending on your criteria you could group the data very differently. Something I would maybe add on to this assignment is exploring the other columns that I skipped at the start and maybe seeing the relationship between them, ultimately how the capital and central load are influenced on other columns and by how much.