As consumers become more conscious and mindful about their energy consumption, especially with appliances, the demand to effectively reduce the amount of energy being utilized has ever so increased. As appliances become more advanced and diversified, variables such as usage patterns and optimization are always taken into consideration in regards to the demand and consumption in residential households. In a generation that is very much conscious about sustainability, it is of the utmost imporance that residential households have the technology and knowledge to understand how one can analyze energy demand and consumption and ultimately strategize methods to utilize appliances in the most efficient manner. With the usage of R, such investigations can be conducted. R is a statistical software and programming language that is used for data analysis and data visualization. For this analysis in particular, statistical methods such as Clustering , Association Rules, and Dimensional Reduction will be utilized to investigate energy demand and consumption amongst appliances within residential households and public buildings.

The data set that will be used for this data analysis is titled “Dataset of an energy community’s generation and consumption with appliance allocation”. Based on real data from the Greater London Area, sample consumption and photovoltaic generation profiles were taken from 50 residential households as well as a municipal library. However, only the observations from the 50 residential households will be taken into account for this analysis. In addition, the observations will be in regards to 10 appliances that are commonly used within households such as air conditioning, kettles, televisions, microwaves, lights, refrigerators, dishwashers, washing machines, and water heaters. However, there was no data that was collected for air conditioning within residential households. As a result, air conditioning will not be included in this analysis. The observations were measured in kilowatt hours, hence the dependent variable. Meanwhile, the independent variable that was measured in this analysis were 15 minute time intervals. Since the purpose of this analysis was to understand the relationship between the usage of these appliances by the hours of the day, the intervals were adjusted in accordance with the objectives of the analysis. In the case of this dataset being used for this particular analysis, temporal aggregation was utilized to combine the observations in the dataset into one hour intervals rather than 15 minute intervals by adding up the 15 minute intervals and dividing them by 4 units in order to obtain the MWh average of that particular hour.

library(readxl)
library(dplyr)
library(arules)
library(arulesViz)
library(factoextra)
install.packages("factoextra")
install.packages("arules")
install.packages("arulesViz")
library(ggplot2)
library(ggrepel)
install.packages("ggrepel")


Energy_Dataset <- "C:/Users/slapm/OneDrive/Desktop/Dataset.xlsx"

These lines of code will combine the data of 50 customers and divide the values of their observations by the mean.

Combined_Energy_Dataset <- list()

Further Documentation: https://www.dataquest.io/blog/for-loop-in-r/

for (i in 1:50) {

  Consumer <- paste0("Consumer", i)


  Combined_Energy_Dataset[[i]] <- read_excel(Energy_Dataset, sheet = Consumer)
}

Further Documentation: https://dplyr.tidyverse.org/reference/bind_rows.html

Energy_Dataset <- bind_rows(Combined_Energy_Dataset)

Further Documentation: https://dplyr.tidyverse.org/reference/group_by.html Further Documentation: https://dplyr.tidyverse.org/reference/across.html Further Documentation: https://dplyr.tidyverse.org/reference/summarise.html

Averages_by_period <- Energy_Dataset %>%
  group_by(...1) %>%   
  summarise(across(where(is.numeric), mean, na.rm = TRUE), .groups = "drop")  


print(Averages_by_period)
Averages_by_period <- Averages_by_period  %>%
  rename(Invervals = ...1)

Further Documentation: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/tapply

Further Documentation: https://r-lang.com/seq_along-function-in-r/

Total Consumption:

observationsTotal_Consumption <- (Averages_by_period$"Total Consumption")

averagesobservationsTotal_Consumption <- tapply(observationsTotal_Consumption, (seq_along(observationsTotal_Consumption) - 1) %/% 4, mean)

print(averagesobservationsTotal_Consumption)

Washing Machine:

observations1Washing_Machine <- (Averages_by_period$"Washing Machine")

averagesobservations1Washing_Machine<- tapply(observations1Washing_Machine, (seq_along(observations1Washing_Machine) - 1) %/% 4, mean)

print(averagesobservations1Washing_Machine)

Dishwasher:

observations1DishWasher <- (Averages_by_period$"Dish washer")

averagesobservations1DishWasher <- tapply(observations1DishWasher, (seq_along(observations1DishWasher) - 1) %/% 4, mean)

print(averagesobservations1DishWasher)

Dryer:

observations1Dryer <- (Averages_by_period$"Dryer")

averagesobservations1Dryer <- tapply(observations1Dryer, (seq_along(observations1Dryer) - 1) %/% 4, mean)

print(averagesobservations1Dryer)

Tv:

observations1TV <- (Averages_by_period$"TV")

averageobservations1TV <- tapply(observations1TV , (seq_along(observations1TV) - 1) %/% 4, mean)

print(averageobservations1TV)

Microwave:

observations1Microwave <- (Averages_by_period$"Microwave")

averageobservations1Microwave <- tapply(observations1Microwave, (seq_along(observations1Microwave) - 1) %/% 4, mean)

print(averageobservations1Microwave)

Kettle:

observations1Kettle <- (Averages_by_period$"Kettle")

averageobservations1Kettle <- tapply(observations1Kettle, (seq_along(observations1Kettle) - 1) %/% 4, mean)

print(averageobservations1Kettle)

Lighting:

observations1Lighting <- (Averages_by_period$"Lighting")

averageobservations1Lighting<- tapply(observations1Lighting, (seq_along(observations1Lighting) - 1) %/% 4, mean)

print(averageobservations1Lighting)

Refrigerator:

observations1Refrigerator <- (Averages_by_period$"Refrigerator")

averageobservations1Refrigerator<- tapply(observations1Refrigerator, (seq_along(observations1Refrigerator) - 1) %/% 4, mean)

print(averageobservations1Refrigerator)

Water Heater:

observations1WaterHeater <- (Averages_by_period$"Water heater")

averageobservations1WaterHeater <- tapply(observations1WaterHeater, (seq_along(observations1WaterHeater) - 1) %/% 4, mean)

print(averageobservations1WaterHeater)

The dates and hours of the observations will be also be included based on their intervals corresponding values

Further Documentation: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/rep

Further Documentation: https://www.statology.org/r-add-days-to-date/

Dates <- seq(as.Date("2019-01-01"), as.Date("2020-01-01"), by = "day")

Hours <- 1:24

complete_dates <- rep(Dates, each = length(Hours))

complete_hours <- rep(Hours, times = length(Dates))
str(complete_dates)
str(complete_hours)


Averages_df <- data.frame(
  Dates = complete_dates,
  Hours = complete_hours,
  Washing_Machine = averagesobservations1Washing_Machine,
  Dish_Washer = averagesobservations1DishWasher,
  Dryer = averagesobservations1Dryer,
  Refrigerator = averageobservations1Refrigerator,
  Lighting = averageobservations1Lighting,
  TV = averageobservations1TV,
  Microwave = averageobservations1Microwave,
  Water_Heater = averageobservations1WaterHeater,
  Kettle = averageobservations1Kettle,
  Total_Consumption = averagesobservationsTotal_Consumption
)


str(Averages_df)

Clustering is a statistical technique that groups values and observations from a dataset into clusters based on their similarities. As a result, one can understand how certain observations and values in a dataset are more in conjunction with one another than with other observations in the dataset.

For this particular analysis, clustering will be used to understand how energy is consumed at certain hours amongst residential households and the amount of energy residential households consume. ’

More specifically, k means will be utilized. The K- Means method minimizes the distance between the observations and their respective cluster centroids.

The value of k indicates how many clusters the algorithm will attempt to identify within the data.

For example, if k is set to 5, the algorithm will partition the data into five separate clusters based on similarity.

The most common similarity measure used in k-means is the Euclidean distance. This metric calculates the straight-line distance between a data point and a centroid, which is the center point of a cluster determined by the mean of all the data points assigned to that cluster.

The smaller the distance, the higher the similarity between the data point and the centroid.

Having such clusters identified can help analysts distinguish patterns amongst similar values of data that were not identified before based on their similarities and differences.

For example, by clustering energy consumption data using the k means method, one can observe and identify groups of consumers that utilize certain appliances at certain hours of the day. However, there could be another cluster that identifies a cluster that uses these appliances at a different hour of the day.

As a result, utility companies can understand when and where energy demand is highest or lowest so that they can better manage their resources, reduce peak loads, and overall improve grid efficiency. Such initiatives such as time-based pricing can be implemented as a result for example.

TO visualize these results, a k means cluster graph will be conducted on all the appliances as well as the total usage of these appliances as well. These graphs will visualize the clusters of users for each appliances in regards to the hour of the day.

Further Documentation: https://www.rdocumentation.org/packages/grDevices/versions/3.6.2/topics/chull

Further Documentation: https://ggplot2.tidyverse.org/reference/aes.html

Further Documentation: https://ggplot2.tidyverse.org/

Total Consumption:

           k <- 5 
      
          kmeans_result_Total_Consumption <- kmeans(Averages_df$Total_Consumption, centers = k)


          length(kmeans_result_Total_Consumption$cluster) 


          Averages_df$Cluster <- as.factor(kmeans_result_Total_Consumption$cluster) 


cluster_df_Total_Consumption <- data.frame(Hour = Averages_df$Hour,
  Total_Consumption = Averages_df$Total_Consumption,
  Cluster = Averages_df$Cluster)


centroids <- cluster_df_Total_Consumption %>%
  group_by(Cluster) %>%
  summarise(Hour = mean(Hour), Total_Consumption = mean(Total_Consumption))


Total_Comsuption_convex_hulls <- cluster_df_Total_Consumption %>%
  group_by(Cluster) %>%
  slice(chull(Hour, Total_Consumption))

ggplot(cluster_df_Total_Consumption, aes(x = Hour, y = Total_Consumption, color = Cluster)) +
  geom_point(alpha = 0.7) +  
  geom_point(data = centroids, aes(x = Hour, y = Total_Consumption), 
             color = "black", size = 4, shape = 3) +  
  geom_polygon(data = Total_Comsuption_convex_hulls, aes(fill = Cluster), alpha = 0.2, show.legend = TRUE) +  
  labs(title = "Cluster Visualization of Total Consumption by kWh",
       x = "Hour",
       y = "Total Consumption by kWh",
       color = "Cluster") +
  theme_minimal() +  
  scale_color_brewer(palette = "Spectral")

ggsave(“Cluster_Total_Consumption.png”, width = 6, height = 4)

In the Total Consumption K Means cluster graph, the evening had the most consumption of energy, with users using between 2.5 and 5.0 kWh from 18:00 to 21:00.

This is most likely due to households arriving home from their jobs and having begun to utilize their appliances after being away from their homes all day.

Dishwasher:

kmeans_result_Dish_Washer <- kmeans(Averages_df$Dish_Washer, centers = k)


Averages_df$Cluster <- as.factor(kmeans_result_Dish_Washer$cluster)


cluster_df_Dish_Washer <- data.frame(Hour = Averages_df$Hour,
  Dish_Washer = Averages_df$Dish_Washer,
  Cluster = Averages_df$Cluster)


centroids_Dish_Washer <- cluster_df_Dish_Washer %>%
  group_by(Cluster) %>%
  summarise(Hour = mean(Hour), Dish_Washer = mean(Dish_Washer))


Dishwasher_convex_hulls <- cluster_df_Dish_Washer  %>%
  group_by(Cluster) %>%
  slice(chull(Hour, Dish_Washer))

ggplot(cluster_df_Dish_Washer, aes(x = Hour, y = Dish_Washer, color = Cluster)) +
  geom_point(alpha = 0.7) +   
  geom_point(data = centroids_Dish_Washer, aes(x = Hour, y = Dish_Washer), 
             color = "black", size = 4, shape = 3) +   
  geom_polygon(data = Dishwasher_convex_hulls, aes(fill = Cluster), alpha = 0.2, show.legend = TRUE) +  
  labs(title = "Cluster Visualization of Dish Washer Consumption by kWh",
       x = "Hour",
       y = "Dishwasher Consumption by kWh",
       color = "Cluster") +
  theme_minimal() +  
  scale_color_brewer(palette = "Spectral")

ggsave(“Cluster_Dish_Washer.png”, width = 6, height = 4)

Washing Machine:

kmeans_result_Washing_Machine <- kmeans(Averages_df$Washing_Machine, centers = k)

Averages_df$Cluster <- as.factor(kmeans_result_Washing_Machine$cluster)

cluster_df_Washing_Machine <- data.frame(Hour = Averages_df$Hour,
                                         Washing_Machine = Averages_df$Washing_Machine,
                                         Cluster = Averages_df$Cluster)


centroids_Washing_Machine <- cluster_df_Washing_Machine %>%
  group_by(Cluster) %>%
  summarise(Hour = mean(Hour), Washing_Machine = mean(Washing_Machine))

Washing_Machine_convex_hulls <- cluster_df_Washing_Machine  %>%
  group_by(Cluster) %>%
  slice(chull(Hour, Washing_Machine))




ggplot(cluster_df_Washing_Machine, aes(x = Hour, y = Washing_Machine, color = Cluster)) +
  geom_point(alpha = 0.7) +  
  geom_point(data = centroids_Washing_Machine, aes(x = Hour, y = Washing_Machine), 
             color = "black", size = 4, shape = 3) +  
  geom_polygon(data = Washing_Machine_convex_hulls, aes(fill = Cluster), alpha = 0.2, show.legend = TRUE) +
  labs(title = "Cluster visualization of Washing Machine Consumption by kWh",
       x = "Hour",
       y = "Washing Machine Consumption by kWh",
       color = "Cluster") +
  theme_minimal() +  
  scale_color_brewer(palette = "Spectral")

ggsave(“Cluster_Washing_Machine.png”, width = 6, height = 4)

Water Heater:

kmeans_result_Water_Heater <- kmeans(Averages_df$Water_Heater, centers = k)

Averages_df$Cluster <- as.factor(kmeans_result_Water_Heater$cluster)

cluster_df_Water_Heater <- data.frame(Hour = Averages_df$Hour,
  Water_Heater = Averages_df$Water_Heater,
  Cluster = Averages_df$Cluster)


centroids_Water_Heater <- cluster_df_Water_Heater %>%
  group_by(Cluster) %>%
  summarise(Hour = mean(Hour), Water_Heater = mean(Water_Heater))

Water_Heater_convex_hulls <- cluster_df_Water_Heater  %>%
  group_by(Cluster) %>%
  slice(chull(Hour, Water_Heater))

ggplot(cluster_df_Water_Heater, aes(x = Hour, y = Water_Heater, color = Cluster)) +
  geom_point(alpha = 0.7) +  
  geom_point(data = centroids_Water_Heater, aes(x = Hour, y = Water_Heater), 
             color = "black", size = 4, shape = 3) +
  geom_polygon(data = Water_Heater_convex_hulls, aes(fill = Cluster), alpha = 0.2, show.legend = TRUE) +  
  labs(title = "Cluster visualization of Water Heater Consumption by kWh",
       x = "Hour",
       y = "Water Heater Consumption by kWh",
       color = "Cluster") +
  theme_minimal() +  
  scale_color_brewer(palette = "Spectral")

ggsave(“Cluster_Water_Heater.png”, width = 6, height = 4)

Tv:

kmeans_result_TV <- kmeans(Averages_df$TV, centers = k)

Averages_df$Cluster <- as.factor(kmeans_result_TV$cluster)

cluster_df_TV <- data.frame(Hour = Averages_df$Hour,
  TV = Averages_df$TV,
  Cluster = Averages_df$Cluster)

centroids_TV <- cluster_df_TV %>%
  group_by(Cluster) %>%
  summarise(Hour = mean(Hour), TV = mean(TV))

TV_convex_hulls <- cluster_df_TV  %>%
  group_by(Cluster) %>%
  slice(chull(Hour, TV))

ggplot(cluster_df_TV, aes(x = Hour, y = TV, color = Cluster)) +
  geom_point(alpha = 0.7) +  
  geom_point(data = centroids_TV, aes(x = Hour, y = TV), 
             color = "black", size = 4, shape = 3) +
  geom_polygon(data = TV_convex_hulls, aes(fill = Cluster), alpha = 0.2, show.legend = TRUE) + 
  labs(title = "Cluster visualization of TV Consumption by kWh",
       x = "Hour",
       y = "TV Consumption by kWh",
       color = "Cluster") +
  theme_minimal() +  
  scale_color_brewer(palette = "Spectral")

ggsave(“Cluster_TV.png”, width = 6, height = 4)

Kettle:

kmeans_result_Kettle <- kmeans(Averages_df$Kettle, centers = k)

Averages_df$Cluster <- as.factor(kmeans_result_Kettle$cluster)

cluster_df_Kettle <- data.frame(Hour = Averages_df$Hour,
  Kettle = Averages_df$Kettle,
  Cluster = Averages_df$Cluster)

centroids_Kettle <- cluster_df_Kettle %>%
  group_by(Cluster) %>%
  summarise(Hour = mean(Hour), Kettle = mean(Kettle))

Kettle_convex_hulls <- cluster_df_Kettle  %>%
  group_by(Cluster) %>%
  slice(chull(Hour, Kettle))

ggplot(cluster_df_Kettle, aes(x = Hour, y = Kettle, color = Cluster)) +
  geom_point(alpha = 0.7) +  
  geom_point(data = centroids_Kettle, aes(x = Hour, y = Kettle), 
             color = "black", size = 4, shape = 3) +
  geom_polygon(data = Kettle_convex_hulls, aes(fill = Cluster), alpha = 0.2, show.legend = TRUE) + 
  labs(title = "Cluster visualization of Kettle by Hour",
       x = "Hour",
       y = "Kettle",
       color = "Cluster") +
  theme_minimal() +  
  scale_color_brewer(palette = "Spectral")

ggsave(“Cluster_Kettle.png”, width = 6, height = 4)

Microwave:

kmeans_result_Microwave <- kmeans(Averages_df$Microwave, centers = k)

Averages_df$Cluster <- as.factor(kmeans_result_Microwave$cluster)

cluster_df_Microwave <- data.frame(Hour = Averages_df$Hour,
   Microwave = Averages_df$Microwave,
   Cluster = Averages_df$Cluster)

centroids_Microwave <- cluster_df_Microwave %>%
  group_by(Cluster) %>%
  summarise(Hour = mean(Hour), Microwave = mean(Microwave))

Microwave_convex_hulls <- cluster_df_Microwave  %>%
  group_by(Cluster) %>%
  slice(chull(Hour, Microwave))

ggplot(cluster_df_Microwave, aes(x = Hour, y = Microwave, color = Cluster)) +
  geom_point(alpha = 0.7) +  
  geom_point(data = centroids_Microwave, aes(x = Hour, y = Microwave), 
             color = "black", size = 4, shape = 3) +  
  geom_polygon(data = Microwave_convex_hulls, aes(fill = Cluster), alpha = 0.2, show.legend = TRUE) +
  labs(title = "Cluster visualization of Microwave Consumption by kWh",
       x = "Hour",
       y = "Microwave Consumption by kWh",
       color = "Cluster") +
  theme_minimal() +  
  scale_color_brewer(palette = "Spectral")

ggsave(“Cluster_Microwave.png”, width = 6, height = 4)

Lighting:

kmeans_result_Lighting <- kmeans(Averages_df$Lighting, centers = k)

Averages_df$Cluster <- as.factor(kmeans_result_Lighting$cluster)

cluster_df_Lighting <- data.frame(Hour = Averages_df$Hour,
  Lighting = Averages_df$Lighting,
  Cluster = Averages_df$Cluster)

centroids_Lighting <- cluster_df_Lighting %>%
  group_by(Cluster) %>%
  summarise(Hour = mean(Hour), Lighting = mean(Lighting))

Lighting_convex_hulls <- cluster_df_Lighting  %>%
  group_by(Cluster) %>%
  slice(chull(Hour, Lighting))

ggplot(cluster_df_Lighting, aes(x = Hour, y = Lighting, color = Cluster)) +
  geom_point(alpha = 0.7) +  
  geom_point(data = centroids_Lighting, aes(x = Hour, y = Lighting), 
             color = "black", size = 4, shape = 3) +  
  geom_polygon(data = Lighting_convex_hulls, aes(fill = Cluster), alpha = 0.2, show.legend = TRUE) +
  labs(title = "Cluster visualization of Lighting Consumption by kWh",
       x = "Hour",
       y = "Lighting Consumption by kWh",
       color = "Cluster") +
  theme_minimal() + 
  scale_color_brewer(palette = "Spectral")

ggsave(“Cluster_Lighting.png”, width = 6, height = 4)

Dryer:

kmeans_result_Dryer <- kmeans(Averages_df$Dryer, centers = k)

length(kmeans_result_Total_Consumption$Dryer) 

Averages_df$Cluster <- as.factor(kmeans_result_Dryer$cluster)  

cluster_df_Dryer <- data.frame(Hour = Averages_df$Hour,
  Dryer = Averages_df$Dryer,
  Cluster = Averages_df$Cluster)

centroids_Dryer <- cluster_df_Dryer %>%
  group_by(Cluster) %>%
  summarise(Hour = mean(Hour), Dryer = mean(Dryer))

Dryer_convex_hulls <- cluster_df_Dryer %>%
  group_by(Cluster) %>%
  slice(chull(Hour, Dryer))

ggplot(cluster_df_Dryer, aes(x = Hour, y = Dryer, color = Cluster)) +
  geom_point(alpha = 0.7) +  
  geom_point(data = centroids_Dryer, aes(x = Hour, y = Dryer), 
             color = "black", size = 4, shape = 3) + 
  geom_polygon(data = Dryer_convex_hulls, aes(fill = Cluster), alpha = 0.2, show.legend = TRUE) +  
  labs(title = "Cluster visualization of Dryer Consumption by kWh by kWh",
       x = "Hour",
       y = "Dryer Consumption by kWh",
       color = "Cluster") +
  theme_minimal() + 
  scale_color_brewer(palette = "Spectral")

ggsave(“Cluster_Dryer.png”, width = 6, height = 4)

Refrigerator:

kmeans_result_Refrigerator <- kmeans(Averages_df$Refrigerator, centers = k)

Averages_df$Cluster <- as.factor(kmeans_result_Refrigerator$cluster)

cluster_df_Refrigerator <- data.frame(Hour = Averages_df$Hour,
  Refrigerator = Averages_df$Refrigerator,
  Cluster = Averages_df$Cluster)

centroids_Refrigerator <- cluster_df_Refrigerator %>%
  group_by(Cluster) %>%
  summarise(Hour = mean(Hour), Refrigerator = mean(Refrigerator))

Refrigerator_convex_hulls <- cluster_df_Refrigerator %>%
  group_by(Cluster) %>%
  slice(chull(Hour, Refrigerator))

ggplot(cluster_df_Refrigerator, aes(x = Hour, y = Refrigerator, color = Cluster)) +
  geom_point(alpha = 0.7) +  
  geom_point(data = centroids_Refrigerator, aes(x = Hour, y = Refrigerator), 
             color = "black", size = 4, shape = 3) + 
  geom_polygon(data = Refrigerator_convex_hulls, aes(fill = Cluster), alpha = 0.2, show.legend = TRUE) +  
  labs(title = "Cluster visualization of Refrigerator Consumption by kWh",
       x = "Hour",
       y = "Refrigerator Consumption by kWh",
       color = "Cluster") +
  theme_minimal() + 
  scale_color_brewer(palette = "Spectral")

ggsave(“Cluster_Refrigerator.png”, width = 6, height = 4)

Dimensionality reduction is a technique used in data analysis in which the number of variables in a dataset are reduced while retaining as much information as possible.

In essence, dimensionality reduction techniques condense the data into fewer dimensions, making it easier to interpret and work with.

In order to understand the dimensionality reduction of the dataset, a PCA analysis will be conducted and graphed on a cumulative variance graph, a PCA scatter plot, and a biplot.

Further Documentation: https://www.rdocumentation.org/packages/memisc/versions/0.99.31.8/topics/Sapply Further Documentation: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/scale Further Documentation: https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/prcomp

ClusterAverage = Averages_df$Cluster

Average_df_notime_Clusters = data.frame(
  Washing_Machine = averagesobservations1Washing_Machine,
  Dish_Washer = averagesobservations1DishWasher,
  Dryer = averagesobservations1Dryer,
  Refrigerator = averageobservations1Refrigerator,
  Lighting = averageobservations1Lighting,
  TV = averageobservations1TV,
  Microwave = averageobservations1Microwave,
  Water_Heater = averageobservations1WaterHeater,
  Kettle = averageobservations1Kettle,
  Cluster = ClusterAverage
)

Average_df_notime = data.frame(
  Washing_Machine = averagesobservations1Washing_Machine,
  Dish_Washer = averagesobservations1DishWasher,
  Dryer = averagesobservations1Dryer,
  Refrigerator = averageobservations1Refrigerator,
  Lighting = averageobservations1Lighting,
  TV = averageobservations1TV,
  Microwave = averageobservations1Microwave,
  Water_Heater = averageobservations1WaterHeater,
  Kettle = averageobservations1Kettle)

Further Documentation: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/cumsum Further Documentation: https://stackoverflow.com/questions/23866765/getting-cumulative-proportion-in-pca

Cumulative variance indicates the total variance explained by the selected principal components.

This measure helps determine how many components are necessary to adequately capture the data’s variability.

Within the graph, these principal components are represented, while the cumulative variance is displayed alongside the vertical axis. The higher the percentage of variance, the more useful these principal components are in the differentiation of energy appliances within households. The 5 percent significance line highlights components that provide meaningful contributions to the analysis.

pca_Energy_Dataset <- Average_df_notime[, sapply(Average_df_notime, is.numeric)]  

pca_Energy_Dataset_scaled <- scale(pca_Energy_Dataset)


pca_result <- prcomp(pca_Energy_Dataset_scaled, center = TRUE, scale. = TRUE)


summary(pca_result)

print(pca_result)

num_components <- 2


reduced_data <- pca_result$x[, 1:num_components]

reduced_data_df <- as.data.frame(reduced_data)




cumulative_variance <- cumsum(pca_result$sdev^2) / sum(pca_result$sdev^2)
cumulative_variance_df <- data.frame(
  Components = seq_along(cumulative_variance),
  CumulativeVariance = cumulative_variance
)


  ggplot(cumulative_variance_df, aes(x = Components, y = CumulativeVariance)) +
  geom_line() +
  geom_point() +
  labs(x = "Number of Components", 
       y = "Cumulative Variance", 
       title = "Cumulative Variance Explained by PCA Components") +
  geom_hline(yintercept = 0.95, linetype = "dashed", color = "red") +
  theme_minimal()

  ggplot(cumulative_variance_df, aes(x = Components, y = CumulativeVariance)) +
  geom_line() +
  geom_point() +
  labs(x = "Number of Components", 
       y = "Cumulative Variance", 
       title = "Cumulative Variance Explained by PCA Components") +
  geom_hline(yintercept = 0.95, linetype = "dashed", color = "red") +
  theme_minimal()

ggsave(“Newcumvar.png”, width = 6, height = 4)

PCA, or Principal Component Analysis is a statistical technique used to determine which variables affect the variances of the observations more so than other variables.

These variables become principal components, and they demonstrate the patterns of the data without any correlated variables affecting the values of the data.

When comparing the first two principal components (PC1 and PC2), one can observe how different clusters groups of appliances relate to one another.

The position of each dot reveals how that particular observation correlates strongly with other observations in relation to the principal components.

Observations that are closer together are more similar to each other based on the attributes that were calculated in the PCA.

The closer these observations are to each other, the more reliable the indicated patterns from the PCA are.

Such convictions that indicate these relations solidify arguments for energy management strategies amongst appliances within households, especially when some of these appliances are dependent on each other.

loadings_df <- as.data.frame(pca_result$rotation)
loadings_df$Variable <- rownames(loadings_df) 

ggplot(reduced_data_df, aes(x = PC1, y = PC2, color = Average_df_notime_Clusters$Cluster)) +
  geom_point(size = 1, alpha = 0.7) +  
  theme_minimal() +
  labs(title = "K-means Clustering on PCA Results", 
       x = "PC1", 
       y = "PC2", 
       color = "Cluster") + 
  theme_minimal() +
  theme(legend.position = "right")
  scale_color_brewer(palette = "Spectral")

ggsave(“PCA.png”, width = 6, height = 4)

In the PCA Scatter Plot, the distribution of clusters is heterogeneous rather than homogeneous. The dots are not evenly spread across the graph, indicating that these clusters have distinct and varying energy consumption behaviors that are different from other cluster groups.

In order to understand the strengths of the PCAs for the independent variables in relation to their respective observations, a biplot can be used to visualize such calculations.

A biplot combines both the PCA graph and the variables in one graph, allowing for interpretation of how strongly each variable contributes to the principal components through the directions and lengths of the vectors.

Longer vectors indicate a stronger relationship between the variables represented by these vectors and the principal components derived from the dataset.

By interpreting the vectors. it becomes possible to understand which appliances contribute to energy consumption with the most significance and correlation.

As a result, decisions could be made regarding energy management to optimize energy usage during certain hours of the day.

ggplot(reduced_data_df, aes(x = PC1, y = PC2)) +
  geom_point(color = "blue") +  
  geom_segment(data = loadings_df, aes(xend = PC1, yend = PC2), 
               x = 0, y = 0, 
               arrow = arrow(type = "open", length = unit(1, "inches")), 
               color = "red") +   
geom_text_repel(data = loadings_df, aes(label = Variable), 
                color = "red", size = 5, 
                max.overlaps = Inf) +   
  labs(title = "PCA Biplot", x = "PC1", y = "PC2") +
  theme_minimal()

ggsave(“Biplot.png”,width = 6, height = 4)

In the biplot, appliances such as kettles and microwaves exhibit high variability, reflecting their usage patterns since these appliances are only used for temporary amounts of time, while refrigerators show less variation due to their constant operation in maintaining the temperature of household goods.

Association rules are a method used in data analysis to find interesting relationships between different variables in a dataset. They help identify patterns that show how items are related to each other based on their occurrences.

Association rules are essentially “If-Then” statements, indicating that the presence of certain patterns under certain conditions implies the presence of similar patterns during other conditions.

The presence of the previous condition is also known as the antecedent and the consequent is the sequential condition that proceeds after the antecedent.The validity of such rules is measured using metrics such as support, confidence, and lift.

Support indicates how frequently the conditions appear together in the dataset, confidence measures the likelihood of the consequent after the antecedent, and lift assesses the strength of the association rule relative to the expected occurrence of the consequent.

For example, an association rule could be proven if a household demands higher than average levels of energy during 6 PM to 7 PM then they are likely to also demand higher than average levels of energy during the 7 PM to 8 PM. This type of scenario is known as the sequential association rule.

The sequential association rule is when one identifies relationships between certain events or occurrences over a period of time. In essence, the sequential association rule captures the likelihood that the occurrence of a particular event will lead to the occurrence of another event.

In order to understand how these association rules impact such likelihoods among the appliances, the Apriori method will be conducted in order to analyze consumption patterns and relationships between different energy appliances.

In addition, a parallel coordinates plot, a scatter plot for association rules, and a matrix plot for association rules will be utilized to visualize the association rules for the appliances in this dataset.

Further Documentation: https://www.rdocumentation.org/packages/arulesViz/versions/1.1-0/topics/plot



rules <- apriori(Average_df_notime, parameter = list(supp = 0.2, conf = 0.6))

In the parallel coordinates plot the values of support from the Apriori method indicate how frequently certain combinations of appliances are used together during certain hours of the day.

The longer the lines in the parallel coordinates plot, the stronger the association between the appliances are and the higher the frequency of usage of such appliances.

The Right-Hand Side (RHS) represents the outcome of a usage rule and the Left-Hand Side (LHS) indicates the conditions leading to that particular outcome.

By analyzing these relationships among the appliances, utility companies can identify peak usage patterns and focus their attention on optimizing energy consumption among households for such combinations.

Further Documentation: https://cran.r-project.org/web/packages/arules/arules.pdf

 plot(rules, method = "paracoord", control = list(reorder = TRUE, alpha = 0.7, col = rainbow(10)))  

Please click on this link for the visualization of this graph:

https://imgur.com/7jnjmTW

Appliances such as microwaves and kettles had the longest lines in this parallel coordinates plot, indicating they are more likely to be used together frequently than other appliances, hence a co-occurrence.

This is most likely due to both appliances having usage during meal preparation.

summary(rules)
  
 
       
set of 940 rules

rule length distribution (lhs + rhs):sizes
1   2   3   4   5   6   7 
7  91 263 322 193  57   7 

 Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
1.000   3.000   4.000   3.853   5.000   7.000 

summary of quality measures:
support         confidence        coverage           lift            count     
Min.   :0.2000   Min.   :0.6141   Min.   :0.2000   Min.   :0.9211   Min.   :1757  
1st Qu.:0.2196   1st Qu.:0.7987   1st Qu.:0.2462   1st Qu.:1.0000   1st Qu.:1929  
Median :0.3333   Median :0.9177   Median :0.3333   Median :1.1488   Median :2928  
Mean   :0.3529   Mean   :0.8900   Mean   :0.4033   Mean   :1.1688   Mean   :3100  
3rd Qu.:0.4647   3rd Qu.:1.0000   3rd Qu.:0.5162   3rd Qu.:1.3071   3rd Qu.:4082  
Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :2.3248   Max.   :8784  

mining info:
                data ntransactions support confidence
Average_df_notime          8784     0.2        0.6
                                                                    call
apriori(data = Average_df_notime, parameter = list(supp = 0.2, conf = 0.6))

The quality measures of these rules provide insights into their significance.

Support, which represents the proportion of transactions where a rule applies, ranges from 0.2 to 1.0.

This means that the least frequent rules appear in at least 20% of the dataset.The median support value of 0.3333 indicates that these rules hold true in about 33.33% of all observations from the dataset.

Confidence, which measures how often the rule is correct when its left-hand side (LHS) conditions are met, has a median value of 0.9177, meaning that these rules hold true in about 91.77 % of all observations from the dataset likelihood when their antecedent conditions are present.

Lift, which indicates the strength of the relationship between items, ranges from 0.9211 to 2.3248, with a median lift of 1.1488. This indicates that these items have a strong association with one another, given that the lift is greater than 1.

The scatter plot for Association Rules utilized the confidence values derived from the Apriori method in order to interpret the relationships between appliances.

Each point on the scatter plot can reflect instances in which certain appliances are used together at certain times, especially when there is high confidence indicated.

The support indicates the overall frequency of usage for the combination of these appliances during specific time frames, while confidence reflects the likelihood that both appliances are used together given one is already in use,

As clusters are visualized in the scatter plot, the dots with high confidence indicates that certain association rules do correlate with being used during certain hours of the day.

plot(rules, method = "scatter", measure = "support", shading = "confidence",
       main = "Scatter Plot of Association Rules")
   

ggsave(“Scatterplot_Association.png”, width = 6, height = 4)

The matrix plot for Association Rules displays confidence values derived from the Apriori method that examine relationships between appliances as well.

However, in the matrix plot, each cell in the matrix represents instances where specific appliances are used simultaneously during certain times of the day, especially with high confidence.

The Left-Hand-Side(LHS) of a rule consists of energy appliances in which under certain conditions would be used. The Right-Hand-Side(RHS) represents the complementary appliance that would be used under the rule regarding appliance usage.

To further clarify, a table displaying the association rules formats the appliances starting from the lift values of the antecedent followed by the lift values of the consequent.

The lift values dictate the correlation of this particular rule from the table in regards to the usage of the appliances within the rule. A lift value greater than 1 would imply a positive correlation, indicating that these appliances would be used after those appliances are used.

By knowing these rules, energy management could become more efficient in targeting insufficiency particles among these appliances in order to optimize energy management since these rules indicate which appliances are associated with each other.

plot(rules, method = "matrix", measure = c("support", "lift"),
       main = "Matrix Plot of Association Rules")
       

ggsave(“Matrix.png”, width = 6, height = 4)

In the matrix association rules plot, the lift values in the matrix indicate the likelihood of an appliance being used given that another appliance is already in use.

Lift values above 1 indicate positive correlation. And in this plot, there are several rules that have a lift value above 1.

This indicates that many of the appliances are used simultaneously.