Part A

Part B

Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx - Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

# Step 1: Download and Load Excel Data from GitHub
# Setting up the file URL from GitHub and saving it to a temporary path for further use.
data_url <- "https://github.com/Naik-Khyati/data_624/raw/main/p1/ResidentialCustomerForecastLoad-624.xlsx"
temp_file <- tempfile(fileext = ".xlsx")
download.file(data_url, temp_file, mode = "wb")

# Step 2: Load the Excel Data into a DataFrame
# Import the downloaded Excel file into an R data frame for analysis.
residential_power_data <- read_excel(temp_file)

# Step 3: Rename the Date Column and Prepare Time Series Data
# Renaming the date column for consistency and transforming the date format to a Year-Month index.
residential_power_data <- rename(residential_power_data, observation_date = `YYYY-MMM`)
power_data_ts <- residential_power_data %>%
  mutate(MonthIndex = yearmonth(observation_date)) %>%  # Convert to year-month format
  select(-CaseSequence, -observation_date) %>%  # Remove unnecessary columns
  tsibble(index = MonthIndex)  # Convert to tsibble format for time series analysis

# Step 4: Identifying Missing Values
# Checking for rows with missing KWH (kilowatt-hour) values.
power_data_ts[which(is.na(power_data_ts$KWH)), ]
## # A tsibble: 1 x 2 [1M]
##     KWH MonthIndex
##   <dbl>      <mth>
## 1    NA   2008 Sep
# Step 5: Calculate the Mean for Missing Value Imputation
# Calculate the mean of KWH excluding missing values to replace NA values.
average_KWH <- mean(power_data_ts$KWH, na.rm = TRUE)

# Step 6: Impute Missing KWH Values with Mean
# Replace missing KWH values with the calculated mean for consistent analysis.
power_data_ts$KWH[which(is.na(power_data_ts$KWH))] <- average_KWH

# Step 7: Summarize the KWH Data
# View summary statistics of the KWH column after imputation.
summary(power_data_ts$KWH)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   770523  5434539  6314472  6502475  7608792 10655730
# Step 8: Plotting the Time Series of KWH
# Visualizing the KWH data as a time series to understand overall patterns in usage.
power_data_ts %>%
  autoplot(KWH) +
  labs(title = "Residential Power Consumption (Kilowatt Hours)")

# Step 9: Time Series Display Before Transformation
# Display partial autocorrelation for the time series data before any transformations.
power_data_ts %>%
  gg_tsdisplay(KWH, plot_type = 'partial') +
  labs(title = "Pre-Transformation: Residential Power Consumption")

# Step 10: Estimate Lambda for Box-Cox Transformation
# Estimating the best lambda value for the Box-Cox transformation using Guerrero's method.
lambda_value <- power_data_ts %>%
  features(KWH, features = guerrero) %>%
  pull(lambda_guerrero)

# Step 11: Conduct Unit Root Test Using KPSS
# Performing a KPSS test to determine the level of differencing required for stationarity.
power_data_ts %>%
  features(box_cox(KWH, lambda_value), unitroot_ndiffs)
## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      1
# Step 12: Apply Box-Cox Transformation and Display Results
# Apply the Box-Cox transformation and display the time series data post-transformation.
power_data_ts %>%
  gg_tsdisplay(difference(box_cox(KWH, lambda_value)), plot_type = 'partial') +
  labs(title = paste("Post-Transformation Power Consumption (λ = ", round(lambda_value, 2), ")"))
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

# Step 13: Compare Different ARIMA Models
# Fit several ARIMA models with varying orders to identify the best-fitting model.
arima_models <- power_data_ts %>%
  model(
    ARIMA_110 = ARIMA(box_cox(KWH, lambda_value) ~ pdq(1, 1, 0)),
    ARIMA_120 = ARIMA(box_cox(KWH, lambda_value) ~ pdq(1, 2, 0)),
    ARIMA_210 = ARIMA(box_cox(KWH, lambda_value) ~ pdq(2, 1, 0)),
    ARIMA_212 = ARIMA(box_cox(KWH, lambda_value) ~ pdq(2, 1, 2)),
    ARIMA_111 = ARIMA(box_cox(KWH, lambda_value) ~ pdq(1, 1, 1))
  )

# Step 14: Review Model Comparison
# Compare the ARIMA models using metrics like AICc and BIC to choose the best fit.
glance(arima_models) %>% 
  arrange(AICc) %>% 
  select(.model, AICc, BIC)
## # A tibble: 5 × 3
##   .model     AICc   BIC
##   <chr>     <dbl> <dbl>
## 1 ARIMA_111  617.  633.
## 2 ARIMA_212  624.  640.
## 3 ARIMA_110  663.  676.
## 4 ARIMA_210  676.  692.
## 5 ARIMA_120  793.  806.
# Step 15: Evaluate Residuals of the Best Model (ARIMA(1,1,1))
# Inspect the residuals of the best-performing ARIMA model to assess fit quality.
arima_models %>%
  select(ARIMA_111) %>%
  gg_tsresiduals() +
  ggtitle("Residuals for ARIMA(1,1,1)")

# Step 16: Forecasting Next 12 Months (2014) with Best ARIMA Model
# Forecast the next 12 months using the ARIMA(1,1,1) model and visualize the forecast.
forecast_data <- power_data_ts %>%
  model(ARIMA_111 = ARIMA(box_cox(KWH, lambda_value) ~ pdq(1, 1, 1))) %>%
  forecast(h = 12)

# Step 17: Plot Forecast Results for 2014
# Generate a time series plot displaying the forecasted power consumption for 2014.
forecast_data %>%
  autoplot() +
  labs(title = "Residential Power Usage Forecast for 2014 (Monthly)")

Two key data transformation steps were undertaken for accurate time series analysis. First, the date column in the dataset was initially formatted as a string. To enable time series processing, this was converted into a year-month index using the yearmonth() function. Following this, the dataset was transformed into a tsibble (time series tibble) format, a structure specifically designed for handling time-indexed data. These steps allow us to leverage advanced time series functions for further analysis, which will be demonstrated in subsequent steps.

The second step in preparing the dataset was to handle missing values, specifically focusing on data cleaning and imputation. In this dataset, we identified only a single missing data point, which lacked a value for KWH. Given the seasonal patterns and consistency observed in the data, mean imputation was chosen as the best method to fill in this gap. This approach maintains the dataset’s overall structure and minimizes the impact on the seasonal trends present in the data.

To evaluate the best forecasting model, I compared several ARIMA models. Among these, the ARIMA(1,1,1) consistently outperformed others across key metrics: the Akaike Information Criterion (AIC), corrected Akaike Information Criterion (AICc), and Bayesian Information Criterion (BIC). These lower values indicate that the ARIMA(1,1,1) model is the most suitable fit for our data. This result aligns with expectations; given the relatively small number of parameters, a simpler model is preferable as it reduces the risk of overfitting and limits unnecessary noise in the forecast. The ARIMA(1,1,1) model strikes a balance between accuracy and model simplicity, making it the optimal choice for our analysis.

In the forecasting analysis, some residual outliers are evident, particularly noticeable in the ACF (Autocorrelation Function) graph. Here, peaks extending beyond the bounded area indicate significant autocorrelation at those lags, surpassing the significance threshold. Additionally, the residual plot highlights a prominent outlier below -10, which stands out from the other residuals. Despite these outliers, the residuals are generally distributed around zero, suggesting that the forecast errors follow a roughly normal distribution. This overall pattern supports the reliability of the model, even with some outlier presence, as it indicates that most forecast errors are minor and unbiased.

Part C

BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

pipe1_url <- "https://github.com/Naik-Khyati/data_624/raw/main/p1/Waterflow_Pipe1.xlsx"
pipe2_url <- "https://github.com/Naik-Khyati/data_624/raw/main/p1/Waterflow_Pipe2.xlsx"

# Create temporary file paths for downloading the Excel files
temp_pipe1 <- tempfile(fileext = ".xlsx")
temp_pipe2 <- tempfile(fileext = ".xlsx")

# Download the water flow data files from GitHub to the temporary file paths
download.file(pipe1_url, temp_pipe1, mode = "wb")
download.file(pipe2_url, temp_pipe2, mode = "wb")

# Read the Excel data into data frames with appropriate column types
pipe1_data_raw <- read_excel(temp_pipe1, col_types = c("date", "numeric"))
pipe2_data_raw <- read_excel(temp_pipe2, col_types = c("date", "numeric"))

# Convert the 'Date Time' column from Excel format to POSIXct format for proper date-time handling
pipe1_data_raw$`Date Time` <- as.POSIXct(pipe1_data_raw$`Date Time`, 
                                         origin = "1899-12-30", tz = "GMT")
pipe2_data_raw$`Date Time` <- as.POSIXct(pipe2_data_raw$`Date Time`, 
                                         origin = "1899-12-30", tz = "GMT")

# Preview the first few rows of the data to check the structure and values
head(pipe1_data_raw)
## # A tibble: 6 × 2
##   `Date Time`         WaterFlow
##   <dttm>                  <dbl>
## 1 2015-10-23 00:24:06     23.4 
## 2 2015-10-23 00:40:02     28.0 
## 3 2015-10-23 00:53:51     23.1 
## 4 2015-10-23 00:55:40     30.0 
## 5 2015-10-23 01:19:17      6.00
## 6 2015-10-23 01:23:58     15.9
head(pipe2_data_raw)
## # A tibble: 6 × 2
##   `Date Time`         WaterFlow
##   <dttm>                  <dbl>
## 1 2015-10-23 01:00:00      18.8
## 2 2015-10-23 02:00:00      43.1
## 3 2015-10-23 03:00:00      38.0
## 4 2015-10-23 04:00:00      36.1
## 5 2015-10-23 05:00:00      31.9
## 6 2015-10-23 06:00:00      28.2
# Extract Date and Time from the 'Date Time' variable
pipe1_data_raw$Date <- as.Date(pipe1_data_raw$`Date Time`)  # Extract date part
pipe1_data_raw$Hour <- hour(pipe1_data_raw$`Date Time`) + 1  # Extract hour and adjust by adding 1

# Group the data by Date and Hour, calculating the average water flow for each hour
pipe1_data_aggregated <- pipe1_data_raw %>%
  group_by(Date, Hour) %>%
  summarise(Average_Water_Flow = mean(WaterFlow, na.rm = TRUE)) %>% 
  ungroup()  # Remove grouping structure
## `summarise()` has grouped output by 'Date'. You can override using the
## `.groups` argument.
# Create a new 'Date Time' column combining Date and Hour for time series analysis
pipe1_data_aggregated$`Date Time` <- with(pipe1_data_aggregated, ymd_h(paste(Date, Hour)))

# Select relevant columns and rename for clarity
pipe1_data_cleaned <- pipe1_data_aggregated %>% 
  select(c(`Date Time`, Average_Water_Flow)) %>%
  rename(WaterFlow = Average_Water_Flow)

# Preview the cleaned data
head(pipe1_data_cleaned)
## # A tibble: 6 × 2
##   `Date Time`         WaterFlow
##   <dttm>                  <dbl>
## 1 2015-10-23 01:00:00      26.1
## 2 2015-10-23 02:00:00      18.9
## 3 2015-10-23 03:00:00      15.2
## 4 2015-10-23 04:00:00      23.1
## 5 2015-10-23 05:00:00      15.5
## 6 2015-10-23 06:00:00      22.7
pipe1_data_ts <- pipe1_data_cleaned %>%
  as_tsibble(index = `Date Time`)

pipe1_data_ts %>% autoplot()
## Plot variable not specified, automatically selected `.vars = WaterFlow`