As consumers become more conscious and mindful about their energy consumption, especially with appliances, the demand to effectively reduce the amount of energy being utilized has ever so increased. As appliances become more advanced and diversified, variables such as usage patterns and optimization are always taken into consideration in regards to the demand and consumption in residential households. In a generation that is very much conscious about sustainability, it is of the utmost importance that residential households have the technology and knowledge to understand how one can analyze energy demand and consumption and ultimately strategize methods to utilize appliances in the most efficient manner. With the usage of R, such investigations can be conducted. R is a statistical software and programming language that is used for data analysis and data visualization. For this analysis in particular, statistical methods such as Clustering , Association Rules, and Dimensional Reduction will be utilized to investigate energy demand and consumption amongst appliances within residential households and public buildings.

The dataset that will be used for this data analysis is titled “Dataset of an energy community’s generation and consumption with appliance allocation”. Based on real data from the Greater London Area, sample consumption and photovoltaic generation profiles were taken from 50 residential households as well as a municipal library. However, only the observations from the 50 residential households will be taken into account for this analysis. In addition, the observations will be in regards to 10 appliances that are commonly used within households such as air conditioning, kettles, televisions, microwaves, lights, refrigerators, dishwashers, washing machines, and water heaters. However, there was no data that was collected for air conditioning within residential households. As a result, air conditioning will not be included in this analysis. The observations were measured in kilowatt hours, hence the dependent variable. Meanwhile, the independent variable that was measured in this analysis were 15 minute time intervals. Since the purpose of this analysis was to understand the relationship between the usage of these appliances by the hours of the day, the intervals were adjusted in accordance with the objectives of the analysis. In the case of this dataset being used for this particular analysis, temporal aggregation was utilized to combine the observations in the dataset into one hour intervals rather than 15 minute intervals by adding up the 15 minute intervals and dividing them by 4 units in order to obtain the MWh average of that particular hour.

          Further Documentation: https://www.sciencedirect.com/science/article/pii/S2352340922007971
      
library(readxl)
library(dplyr)
library(arules)
library(arulesViz)
library(factoextra)
install.packages("factoextra")
install.packages("arules")
install.packages("arulesViz")
library(ggplot2)
library(ggrepel)
install.packages("ggrepel")


Energy_Dataset <- "C:/Users/slapm/OneDrive/Desktop/Dataset.xlsx"

These lines of code will combine the data of 50 customers and divide the values of their observations by the mean.

Combined_Energy_Dataset <- list()

          Further Documentation: https://www.dataquest.io/blog/for-loop-in-r/

for (i in 1:50) {

  Consumer <- paste0("Consumer", i)


  Combined_Energy_Dataset[[i]] <- read_excel(Energy_Dataset, sheet = Consumer)
}

          Further Documentation: https://dplyr.tidyverse.org/reference/bind_rows.html

Energy_Dataset <- bind_rows(Combined_Energy_Dataset)

          Further Documentation: https://dplyr.tidyverse.org/reference/group_by.html

          Further Documentation: https://dplyr.tidyverse.org/reference/across.html

          Further Documentation: https://dplyr.tidyverse.org/reference/summarise.html


Averages_by_period <- Energy_Dataset %>%
  group_by(...1) %>%   
  summarise(across(where(is.numeric), mean, na.rm = TRUE), .groups = "drop")  


print(Averages_by_period)
Averages_by_period <- Averages_by_period  %>%
  rename(Invervals = ...1)






          Further Documentation: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/tapply

          Further Documentation: https://r-lang.com/seq_along-function-in-r/
   
  

Total Consumption:

observationsTotal_Consumption <- (Averages_by_period$"Total Consumption")

averagesobservationsTotal_Consumption <- tapply(observationsTotal_Consumption, (seq_along(observationsTotal_Consumption) - 1) %/% 4, mean)

print(averagesobservationsTotal_Consumption)

Washing Machine:

observations1Washing_Machine <- (Averages_by_period$"Washing Machine")

averagesobservations1Washing_Machine<- tapply(observations1Washing_Machine, (seq_along(observations1Washing_Machine) - 1) %/% 4, mean)

print(averagesobservations1Washing_Machine)

Dishwasher:

observations1DishWasher <- (Averages_by_period$"Dish washer")

averagesobservations1DishWasher <- tapply(observations1DishWasher, (seq_along(observations1DishWasher) - 1) %/% 4, mean)

print(averagesobservations1DishWasher)

Dryer:

observations1Dryer <- (Averages_by_period$"Dryer")

averagesobservations1Dryer <- tapply(observations1Dryer, (seq_along(observations1Dryer) - 1) %/% 4, mean)

print(averagesobservations1Dryer)

Tv:

observations1TV <- (Averages_by_period$"TV")

averageobservations1TV <- tapply(observations1TV , (seq_along(observations1TV) - 1) %/% 4, mean)

print(averageobservations1TV)

Microwave:

observations1Microwave <- (Averages_by_period$"Microwave")

averageobservations1Microwave <- tapply(observations1Microwave, (seq_along(observations1Microwave) - 1) %/% 4, mean)

print(averageobservations1Microwave)

Kettle:

observations1Kettle <- (Averages_by_period$"Kettle")

averageobservations1Kettle <- tapply(observations1Kettle, (seq_along(observations1Kettle) - 1) %/% 4, mean)

print(averageobservations1Kettle)

Lighting:

observations1Lighting <- (Averages_by_period$"Lighting")

averageobservations1Lighting<- tapply(observations1Lighting, (seq_along(observations1Lighting) - 1) %/% 4, mean)

print(averageobservations1Lighting)

Refrigerator:

observations1Refrigerator <- (Averages_by_period$"Refrigerator")

averageobservations1Refrigerator<- tapply(observations1Refrigerator, (seq_along(observations1Refrigerator) - 1) %/% 4, mean)

print(averageobservations1Refrigerator)

Water Heater:

observations1WaterHeater <- (Averages_by_period$"Water heater")

averageobservations1WaterHeater <- tapply(observations1WaterHeater, (seq_along(observations1WaterHeater) - 1) %/% 4, mean)

print(averageobservations1WaterHeater)

The dates and hours of the observations will be also be included based on their intervals corresponding values

          Further Documentation: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/rep

          Further Documentation: https://www.statology.org/r-add-days-to-date/

Dates <- seq(as.Date("2019-01-01"), as.Date("2020-01-01"), by = "day")

Hours <- 1:24

complete_dates <- rep(Dates, each = length(Hours))

complete_hours <- rep(Hours, times = length(Dates))
str(complete_dates)
str(complete_hours)


Averages_df <- data.frame(
  Dates = complete_dates,
  Hours = complete_hours,
  Washing_Machine = averagesobservations1Washing_Machine,
  Dish_Washer = averagesobservations1DishWasher,
  Dryer = averagesobservations1Dryer,
  Refrigerator = averageobservations1Refrigerator,
  Lighting = averageobservations1Lighting,
  TV = averageobservations1TV,
  Microwave = averageobservations1Microwave,
  Water_Heater = averageobservations1WaterHeater,
  Kettle = averageobservations1Kettle,
  Total_Consumption = averagesobservationsTotal_Consumption
)


str(Averages_df)

Clustering is a statistical technique that groups values and observations from a dataset into clusters based on their similarities. As a result, one can understand how certain observations and values in a dataset are more in conjunction with one another than with other observations in the dataset.

For this particular analysis, clustering will be used to understand how energy is consumed at certain hours amongst residential households and the amount of energy residential households consume. ’

More specifically, k means will be utilized. The K- Means method minimizes the distance between the observations and their respective cluster centroids.

The value of k indicates how many clusters the algorithm will attempt to identify within the data.

For example, if k is set to 5, the algorithm will partition the data into five separate clusters based on similarity.

The most common similarity measure used in k-means is the Euclidean distance. This metric calculates the straight-line distance between a data point and a centroid, which is the center point of a cluster determined by the mean of all the data points assigned to that cluster.

The smaller the distance, the higher the similarity between the data point and the centroid.

Having such clusters identified can help analysts distinguish patterns among similar values of data that were not identified before based on their similarities and differences.

For example, by clustering energy consumption data using the k means method, one can observe and identify groups of consumers that utilize certain appliances at certain hours of the day. However, there could be another cluster that identifies a cluster that uses these appliances at a different hour of the day.

As a result, utility companies can understand when and where energy demand is highest or lowest so that they can better manage their resources, reduce peak loads, and overall improve grid efficiency. Such initiatives such as time-based pricing can be implemented as a result for example.

To visualize these results, a k means cluster graph will be conducted on all the appliances as well as the total usage of these appliances as well. These graphs will visualize the clusters of users for each appliances in regards to the hour of the day.

          Further Documentation: https://www.rdocumentation.org/packages/grDevices/versions/3.6.2/topics/chull

          Further Documentation: https://ggplot2.tidyverse.org/reference/aes.html

          Further Documentation: https://ggplot2.tidyverse.org/

Total Consumption:

           k <- 5 
      
          kmeans_result_Total_Consumption <- kmeans(Averages_df$Total_Consumption, centers = k)


          length(kmeans_result_Total_Consumption$cluster) 


          Averages_df$Cluster <- as.factor(kmeans_result_Total_Consumption$cluster) 


cluster_df_Total_Consumption <- data.frame(Hour = Averages_df$Hour,
  Total_Consumption = Averages_df$Total_Consumption,
  Cluster = Averages_df$Cluster)


centroids <- cluster_df_Total_Consumption %>%
  group_by(Cluster) %>%
  summarise(Hour = mean(Hour), Total_Consumption = mean(Total_Consumption))


Total_Comsuption_convex_hulls <- cluster_df_Total_Consumption %>%
  group_by(Cluster) %>%
  slice(chull(Hour, Total_Consumption))

ggplot(cluster_df_Total_Consumption, aes(x = Hour, y = Total_Consumption, color = Cluster)) +
  geom_point(alpha = 0.7) +  
  geom_point(data = centroids, aes(x = Hour, y = Total_Consumption), 
             color = "black", size = 4, shape = 3) +  
  geom_polygon(data = Total_Comsuption_convex_hulls, aes(fill = Cluster), alpha = 0.2, show.legend = TRUE) +  
  labs(title = "Cluster Visualization of Total Consumption by kWh",
       x = "Hour",
       y = "Total Consumption by kWh",
       color = "Cluster") +
  theme_minimal() +  
  scale_color_brewer(palette = "Spectral")

ggsave(“Cluster_Total_Consumption.png”, width = 6, height = 4)

In the Total Consumption K-Means cluster graph, the evening had the highest energy consumption, with users consuming between 1.5 and 2.5 kWh from 18:00 to 21:00.

This peak in consumption is most likely due to households arriving home from work and beginning to use their appliances after being away during the day. The mean point of this cluster appears to be around 2.0 kWh at around 19:30, which supports the statement from the previous sentence.

Dishwasher:

kmeans_result_Dish_Washer <- kmeans(Averages_df$Dish_Washer, centers = k)


Averages_df$Cluster <- as.factor(kmeans_result_Dish_Washer$cluster)


cluster_df_Dish_Washer <- data.frame(Hour = Averages_df$Hour,
  Dish_Washer = Averages_df$Dish_Washer,
  Cluster = Averages_df$Cluster)


centroids_Dish_Washer <- cluster_df_Dish_Washer %>%
  group_by(Cluster) %>%
  summarise(Hour = mean(Hour), Dish_Washer = mean(Dish_Washer))


Dishwasher_convex_hulls <- cluster_df_Dish_Washer  %>%
  group_by(Cluster) %>%
  slice(chull(Hour, Dish_Washer))

ggplot(cluster_df_Dish_Washer, aes(x = Hour, y = Dish_Washer, color = Cluster)) +
  geom_point(alpha = 0.7) +   
  geom_point(data = centroids_Dish_Washer, aes(x = Hour, y = Dish_Washer), 
             color = "black", size = 4, shape = 3) +   
  geom_polygon(data = Dishwasher_convex_hulls, aes(fill = Cluster), alpha = 0.2, show.legend = TRUE) +  
  labs(title = "Cluster Visualization of Dish Washer Consumption by kWh",
       x = "Hour",
       y = "Dishwasher Consumption by kWh",
       color = "Cluster") +
  theme_minimal() +  
  scale_color_brewer(palette = "Spectral")

ggsave(“Cluster_Dish_Washer.png”, width = 6, height = 4)

In the Dishwasher K-Means cluster graph, the evening also had the highest energy consumption, with users consuming between 0.15 and 0.3 kWh from 18:00 to 22:00. This trend is primarily driven by households running their dishwashers after dinner, as dishes accumulate throughout the day. The mean point from the cluster group with the highest usage appears to be around 0.2 kWh at around 20:00, suggesting that most dish washing activity occurs shortly after dinner before tapering off toward late evening.

Washing Machine:

kmeans_result_Washing_Machine <- kmeans(Averages_df$Washing_Machine, centers = k)

Averages_df$Cluster <- as.factor(kmeans_result_Washing_Machine$cluster)

cluster_df_Washing_Machine <- data.frame(Hour = Averages_df$Hour,
                                         Washing_Machine = Averages_df$Washing_Machine,
                                         Cluster = Averages_df$Cluster)


centroids_Washing_Machine <- cluster_df_Washing_Machine %>%
  group_by(Cluster) %>%
  summarise(Hour = mean(Hour), Washing_Machine = mean(Washing_Machine))

Washing_Machine_convex_hulls <- cluster_df_Washing_Machine  %>%
  group_by(Cluster) %>%
  slice(chull(Hour, Washing_Machine))




ggplot(cluster_df_Washing_Machine, aes(x = Hour, y = Washing_Machine, color = Cluster)) +
  geom_point(alpha = 0.7) +  
  geom_point(data = centroids_Washing_Machine, aes(x = Hour, y = Washing_Machine), 
             color = "black", size = 4, shape = 3) +  
  geom_polygon(data = Washing_Machine_convex_hulls, aes(fill = Cluster), alpha = 0.2, show.legend = TRUE) +
  labs(title = "Cluster visualization of Washing Machine Consumption by kWh",
       x = "Hour",
       y = "Washing Machine Consumption by kWh",
       color = "Cluster") +
  theme_minimal() +  
  scale_color_brewer(palette = "Spectral")

ggsave(“Cluster_Washing_Machine.png”, width = 6, height = 4)

In the Washing Machine K-Means cluster graph, the evening had the highest energy consumption, with users consuming between 0.2 and 0.3 kWh from 09:00 to 23:00. This pattern suggests that many households do their laundry in the evening after returning home or before leaving for work, as such times are convenient to wash clothes. The mean point from the cluster group with the highest usage appears to be around 0.2 kWh at around 17:30, indicating that washing machine usage peaks shortly after people settle in for the evening and then declines toward the later hours.

However, the spread of the distribution is bimodal, as this particular cluster had its highest energy usage in the morning at 09:00 and in the evening at 23:00.

Water Heater:

kmeans_result_Water_Heater <- kmeans(Averages_df$Water_Heater, centers = k)

Averages_df$Cluster <- as.factor(kmeans_result_Water_Heater$cluster)

cluster_df_Water_Heater <- data.frame(Hour = Averages_df$Hour,
  Water_Heater = Averages_df$Water_Heater,
  Cluster = Averages_df$Cluster)


centroids_Water_Heater <- cluster_df_Water_Heater %>%
  group_by(Cluster) %>%
  summarise(Hour = mean(Hour), Water_Heater = mean(Water_Heater))

Water_Heater_convex_hulls <- cluster_df_Water_Heater  %>%
  group_by(Cluster) %>%
  slice(chull(Hour, Water_Heater))

ggplot(cluster_df_Water_Heater, aes(x = Hour, y = Water_Heater, color = Cluster)) +
  geom_point(alpha = 0.7) +  
  geom_point(data = centroids_Water_Heater, aes(x = Hour, y = Water_Heater), 
             color = "black", size = 4, shape = 3) +
  geom_polygon(data = Water_Heater_convex_hulls, aes(fill = Cluster), alpha = 0.2, show.legend = TRUE) +  
  labs(title = "Cluster visualization of Water Heater Consumption by kWh",
       x = "Hour",
       y = "Water Heater Consumption by kWh",
       color = "Cluster") +
  theme_minimal() +  
  scale_color_brewer(palette = "Spectral")

ggsave(“Cluster_Water_Heater.png”, width = 6, height = 4)

In the Water Heater K-Means cluster graph, the highest energy consumption occurred during two main periods, with users consuming between 0.1 and 0.25 kWh from 00:00 to 05:00 and again from 18:00 to 22:00. These peaks align with morning and evening bathing routines, as well as meal preparation times when hot water is frequently used. The mean point of the evening cluster is approximately 0.15 kWh at around 12:30. However, this is due to the spread having a bimodal distribution.

Tv:

kmeans_result_TV <- kmeans(Averages_df$TV, centers = k)

Averages_df$Cluster <- as.factor(kmeans_result_TV$cluster)

cluster_df_TV <- data.frame(Hour = Averages_df$Hour,
  TV = Averages_df$TV,
  Cluster = Averages_df$Cluster)

centroids_TV <- cluster_df_TV %>%
  group_by(Cluster) %>%
  summarise(Hour = mean(Hour), TV = mean(TV))

TV_convex_hulls <- cluster_df_TV  %>%
  group_by(Cluster) %>%
  slice(chull(Hour, TV))

ggplot(cluster_df_TV, aes(x = Hour, y = TV, color = Cluster)) +
  geom_point(alpha = 0.7) +  
  geom_point(data = centroids_TV, aes(x = Hour, y = TV), 
             color = "black", size = 4, shape = 3) +
  geom_polygon(data = TV_convex_hulls, aes(fill = Cluster), alpha = 0.2, show.legend = TRUE) + 
  labs(title = "Cluster visualization of TV Consumption by kWh",
       x = "Hour",
       y = "TV Consumption by kWh",
       color = "Cluster") +
  theme_minimal() +  
  scale_color_brewer(palette = "Spectral")

ggsave(“Cluster_TV.png”, width = 6, height = 4)

In the TV K-Means cluster graph, the highest energy consumption occurred in the evening, with users consuming between 0.04 and 0.06 kWh from 16:00 to 20:00. This trend aligns with typical entertainment habits, where households unwind after work by watching television as soon as they arrive home. The mean point from the cluster group with the highest usage is around 0.02 kWh at around 16:00, further supporting the statement from the previous sentence.

Kettle:

kmeans_result_Kettle <- kmeans(Averages_df$Kettle, centers = k)

Averages_df$Cluster <- as.factor(kmeans_result_Kettle$cluster)

cluster_df_Kettle <- data.frame(Hour = Averages_df$Hour,
  Kettle = Averages_df$Kettle,
  Cluster = Averages_df$Cluster)

centroids_Kettle <- cluster_df_Kettle %>%
  group_by(Cluster) %>%
  summarise(Hour = mean(Hour), Kettle = mean(Kettle))

Kettle_convex_hulls <- cluster_df_Kettle  %>%
  group_by(Cluster) %>%
  slice(chull(Hour, Kettle))

ggplot(cluster_df_Kettle, aes(x = Hour, y = Kettle, color = Cluster)) +
  geom_point(alpha = 0.7) +  
  geom_point(data = centroids_Kettle, aes(x = Hour, y = Kettle), 
             color = "black", size = 4, shape = 3) +
  geom_polygon(data = Kettle_convex_hulls, aes(fill = Cluster), alpha = 0.2, show.legend = TRUE) + 
  labs(title = "Cluster visualization of Kettle by Hour",
       x = "Hour",
       y = "Kettle",
       color = "Cluster") +
  theme_minimal() +  
  scale_color_brewer(palette = "Spectral")

ggsave(“Cluster_Kettle.png”, width = 6, height = 4)

In the Kettle K-Means cluster graph, the highest energy consumption occurred during two distinct peaks, with users consuming between 0.1 and 0.2 kWh from 06:00 to 09:00 and again from 16:00 to 19:00. These peaks coincide with common tea and coffee preparation times, as households use kettles for such tasks before work and to unwind from work as well. The mean point from the cluster group with the highest usage is around 0.1 kWh at 12:30. However, this is due to the spread having a bimodal distrubution.

Microwave:

kmeans_result_Microwave <- kmeans(Averages_df$Microwave, centers = k)

Averages_df$Cluster <- as.factor(kmeans_result_Microwave$cluster)

cluster_df_Microwave <- data.frame(Hour = Averages_df$Hour,
   Microwave = Averages_df$Microwave,
   Cluster = Averages_df$Cluster)

centroids_Microwave <- cluster_df_Microwave %>%
  group_by(Cluster) %>%
  summarise(Hour = mean(Hour), Microwave = mean(Microwave))

Microwave_convex_hulls <- cluster_df_Microwave  %>%
  group_by(Cluster) %>%
  slice(chull(Hour, Microwave))

ggplot(cluster_df_Microwave, aes(x = Hour, y = Microwave, color = Cluster)) +
  geom_point(alpha = 0.7) +  
  geom_point(data = centroids_Microwave, aes(x = Hour, y = Microwave), 
             color = "black", size = 4, shape = 3) +  
  geom_polygon(data = Microwave_convex_hulls, aes(fill = Cluster), alpha = 0.2, show.legend = TRUE) +
  labs(title = "Cluster visualization of Microwave Consumption by kWh",
       x = "Hour",
       y = "Microwave Consumption by kWh",
       color = "Cluster") +
  theme_minimal() +  
  scale_color_brewer(palette = "Spectral")

ggsave(“Cluster_Microwave.png”, width = 6, height = 4) In the Microwave K-Means cluster graph, the highest energy consumption occurred during three distinct peaks, with users consuming between 0.1 and 0.2 kWh from 07:00 to 09:00 for breakfast, 12:00 to 14:00 for lunch, and 18:00 to 20:00 for dinner. These peaks align with common meal preparation times, as households primarily use microwaves for reheating food or quick cooking. The mean point from the cluster group with the highest usage is around 0.2 kWh at 13:00. However, this is due to the spread having a trimodal distribution.

Lighting:

kmeans_result_Lighting <- kmeans(Averages_df$Lighting, centers = k)

Averages_df$Cluster <- as.factor(kmeans_result_Lighting$cluster)

cluster_df_Lighting <- data.frame(Hour = Averages_df$Hour,
  Lighting = Averages_df$Lighting,
  Cluster = Averages_df$Cluster)

centroids_Lighting <- cluster_df_Lighting %>%
  group_by(Cluster) %>%
  summarise(Hour = mean(Hour), Lighting = mean(Lighting))

Lighting_convex_hulls <- cluster_df_Lighting  %>%
  group_by(Cluster) %>%
  slice(chull(Hour, Lighting))

ggplot(cluster_df_Lighting, aes(x = Hour, y = Lighting, color = Cluster)) +
  geom_point(alpha = 0.7) +  
  geom_point(data = centroids_Lighting, aes(x = Hour, y = Lighting), 
             color = "black", size = 4, shape = 3) +  
  geom_polygon(data = Lighting_convex_hulls, aes(fill = Cluster), alpha = 0.2, show.legend = TRUE) +
  labs(title = "Cluster visualization of Lighting Consumption by kWh",
       x = "Hour",
       y = "Lighting Consumption by kWh",
       color = "Cluster") +
  theme_minimal() + 
  scale_color_brewer(palette = "Spectral")

ggsave(“Cluster_Lighting.png”, width = 6, height = 4)

In the Lighting K-Means cluster graph, the highest energy consumption occurred in the evening, with users consuming between 0.01 and 0.15 kWh from 17:00 to 23:00. This trend is expected, as lighting demand increases after sunset when households become more active indoors. The mean point from the cluster group with the highest usage is around 0.2 kWh at round 20:00, supporting the statement from the previous sentence.

Dryer:

kmeans_result_Dryer <- kmeans(Averages_df$Dryer, centers = k)

length(kmeans_result_Total_Consumption$Dryer) 

Averages_df$Cluster <- as.factor(kmeans_result_Dryer$cluster)  

cluster_df_Dryer <- data.frame(Hour = Averages_df$Hour,
  Dryer = Averages_df$Dryer,
  Cluster = Averages_df$Cluster)

centroids_Dryer <- cluster_df_Dryer %>%
  group_by(Cluster) %>%
  summarise(Hour = mean(Hour), Dryer = mean(Dryer))

Dryer_convex_hulls <- cluster_df_Dryer %>%
  group_by(Cluster) %>%
  slice(chull(Hour, Dryer))

ggplot(cluster_df_Dryer, aes(x = Hour, y = Dryer, color = Cluster)) +
  geom_point(alpha = 0.7) +  
  geom_point(data = centroids_Dryer, aes(x = Hour, y = Dryer), 
             color = "black", size = 4, shape = 3) + 
  geom_polygon(data = Dryer_convex_hulls, aes(fill = Cluster), alpha = 0.2, show.legend = TRUE) +  
  labs(title = "Cluster visualization of Dryer Consumption by kWh",
       x = "Hour",
       y = "Dryer Consumption by kWh",
       color = "Cluster") +
  theme_minimal() + 
  scale_color_brewer(palette = "Spectral")

ggsave(“Cluster_Dryer.png”, width = 6, height = 4)

In the Dryer K-Means cluster graph, the highest energy consumption occurred in the evening, with users consuming between 0.1 and 0.25 kWh from 08:00 to 17:00. The mean point of this cluster is around 0.15 kWh at around 12:30. This is most likely due to users finishing washing their clothes in the morning just to switch their clothes over to the dryer.

Refrigerator:

kmeans_result_Refrigerator <- kmeans(Averages_df$Refrigerator, centers = k)

Averages_df$Cluster <- as.factor(kmeans_result_Refrigerator$cluster)

cluster_df_Refrigerator <- data.frame(Hour = Averages_df$Hour,
  Refrigerator = Averages_df$Refrigerator,
  Cluster = Averages_df$Cluster)

centroids_Refrigerator <- cluster_df_Refrigerator %>%
  group_by(Cluster) %>%
  summarise(Hour = mean(Hour), Refrigerator = mean(Refrigerator))

Refrigerator_convex_hulls <- cluster_df_Refrigerator %>%
  group_by(Cluster) %>%
  slice(chull(Hour, Refrigerator))

ggplot(cluster_df_Refrigerator, aes(x = Hour, y = Refrigerator, color = Cluster)) +
  geom_point(alpha = 0.7) +  
  geom_point(data = centroids_Refrigerator, aes(x = Hour, y = Refrigerator), 
             color = "black", size = 4, shape = 3) + 
  geom_polygon(data = Refrigerator_convex_hulls, aes(fill = Cluster), alpha = 0.2, show.legend = TRUE) +  
  labs(title = "Cluster visualization of Refrigerator Consumption by kWh",
       x = "Hour",
       y = "Refrigerator Consumption by kWh",
       color = "Cluster") +
  theme_minimal() + 
  scale_color_brewer(palette = "Spectral")

ggsave(“Cluster_Refrigerator.png”, width = 6, height = 4)

In the Refrigerator K-Means cluster graph, the highest energy consumption occurred in the morning between 02:00 and 07:00,and in the evening between 17:00 and 21:00, with users consuming between 0.2 and 0.4 kWh per hour. While refrigerators operate continuously, slight increases were observed around meal times due to frequent door openings and temperature regulation when such activities are occurring. The mean point from the cluster group with the highest usage is around 0.05 kWh per hour at around 13:00. However, this is due to the spread having a trimodal distribution, as households use the refrigerator during morning, lunch, and evening hours more frequently in comparison to other hours of the day.

Dimensionality reduction is a technique used in data analysis in which the number of variables in a dataset are reduced while retaining as much information as possible.

In essence, dimensionality reduction techniques condense the data into fewer dimensions, making it easier to interpret and work with.

In order to understand the dimensionality reduction of the dataset, a PCA analysis will be conducted and graphed on a cumulative variance graph, a PCA scatter plot, and a biplot.

          Further Documentation: https://www.rdocumentation.org/packages/memisc/versions/0.99.31.8/topics/Sapply

          Further Documentation: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/scale

          Further Documentation: https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/prcomp





ClusterAverage = Averages_df$Cluster

Average_df_notime_Clusters = data.frame(
  Washing_Machine = averagesobservations1Washing_Machine,
  Dish_Washer = averagesobservations1DishWasher,
  Dryer = averagesobservations1Dryer,
  Refrigerator = averageobservations1Refrigerator,
  Lighting = averageobservations1Lighting,
  TV = averageobservations1TV,
  Microwave = averageobservations1Microwave,
  Water_Heater = averageobservations1WaterHeater,
  Kettle = averageobservations1Kettle,
  Cluster = ClusterAverage
)

Average_df_notime = data.frame(
  Washing_Machine = averagesobservations1Washing_Machine,
  Dish_Washer = averagesobservations1DishWasher,
  Dryer = averagesobservations1Dryer,
  Refrigerator = averageobservations1Refrigerator,
  Lighting = averageobservations1Lighting,
  TV = averageobservations1TV,
  Microwave = averageobservations1Microwave,
  Water_Heater = averageobservations1WaterHeater,
  Kettle = averageobservations1Kettle)









          Further Documentation: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/cumsum

          Further Documentation: https://stackoverflow.com/questions/23866765/getting-cumulative-proportion-in-pca

Cumulative variance indicates the total variance explained by the selected principal components.

This measure helps determine how many components are necessary to adequately capture the data’s variability.

Within the graph, these principal components are represented, while the cumulative variance is displayed alongside the vertical axis. The higher the percentage of variance, the more useful these principal components are in the differentiation of energy appliances within households. The 5 percent significance line highlights components that provide meaningful contributions to the analysis.

pca_Energy_Dataset <- Average_df_notime[, sapply(Average_df_notime, is.numeric)]  

pca_Energy_Dataset_scaled <- scale(pca_Energy_Dataset)


pca_result <- prcomp(pca_Energy_Dataset_scaled, center = TRUE, scale. = TRUE)


summary(pca_result)

Importance of components:
                         PC1    PC2    PC3    PC4     PC5     PC6     PC7     PC8     PC9
Standard deviation     1.818 1.2652 1.1527 0.9594 0.79764 0.69335 0.56103 0.47694 0.43287
Proportion of Variance 0.367 0.1779 0.1476 0.1023 0.07069 0.05341 0.03497 0.02527 0.02082
Cumulative Proportion  0.367 0.5449 0.6926 0.7948 0.86552 0.91893 0.95391 0.97918 1.00000

PC1 is the most important principal component as it has the highest standard deviation (1.818), proportion of variance (36.7%), and cumulative proportion (36.7%). PC2 is the second most important component, explaining an additional 17.79% of the variance. For further analysis, at least 80% of the variance should be reatined. It would be advisable to include PC4, which increases the cumulative proportion to 79.48%, or PC5, which captures 86.55% of the variance. This means that most of the data has been ensured for additional analysis.

print(pca_result)

Standard deviations (1, .., p=9):
[1] 1.8175411 1.2652387 1.1527078 0.9593765 0.7976445 0.6933492 0.5610328 0.4769378 0.4328738

Rotation (n x k) = (9 x 9):
                   PC1         PC2        PC3         PC4         PC5         PC6           PC7
Washing_Machine 0.45044039  0.05908587  0.2450831  0.10001502 -0.20155009 -0.34984296  0.5211225677
Dish_Washer     0.47621094 -0.02529252  0.1726730  0.02632973 -0.27715433 -0.23243115  0.0344192269
Dryer           0.17684138  0.26981493  0.6540752  0.30819287 -0.01085174  0.26450325 -0.5125962239
Refrigerator    0.07102452 -0.58234552  0.2829506 -0.03493650  0.63142930 -0.38785560 -0.1303114802
Lighting        0.35684307 -0.33892561 -0.4123166  0.04876523 -0.31240255  0.01090998 -0.3291835259
TV              0.33026365 -0.24766530 -0.2931105  0.56373974  0.18971896  0.35568815 -0.0005888541
Microwave       0.34397721  0.41108777 -0.1529433 -0.04581222  0.56341951  0.25409903  0.3099839215
Water_Heater    0.25329778 -0.32690087  0.2294286 -0.64310915 -0.07979140  0.56556606  0.0946476633
Kettle          0.33929057  0.36314839 -0.2651320 -0.39652447  0.16231286 -0.30352488 -0.4838390884
                    PC8         PC9
Washing_Machine  0.51967597 -0.13747437
Dish_Washer     -0.70711705  0.33171708
Dryer            0.08847058 -0.18018929
Refrigerator    -0.03051478 -0.08660503
Lighting        -0.03212484 -0.61506197
TV               0.20921615  0.46860694
Microwave       -0.28070779 -0.36097280
Water_Heater     0.11875308  0.11600457
Kettle           0.28889722  0.29907164

The standard deviations indicate that PC1 has the highest spread (1.8175), meaning it captures the most variance, while PC9 has the lowest (0.4329), contributing the least.

From the results of the rotation matrix, which display the values of the loadings from the appliances that have the most influence on the principal components, PC1, the most important component, is primarily influenced by the Washing Machine (0.4504), Dish Washer (0.4762), Lighting (0.3568), and TV (0.3303), suggesting that PC1 captures overall household energy consumption from major appliances.

PC2, the second most important component, is largely affected by the Refrigerator (-0.5823) and Microwave (0.4111). These values suggest that the refrigerator has an inverse relationship with PC2, and the microwave does not. This is likely due to the refrigerator maintaining constant energy usage, while microwaves are used in short frequencies.

num_components <- 2


reduced_data <- pca_result$x[, 1:num_components]

reduced_data_df <- as.data.frame(reduced_data)




cumulative_variance <- cumsum(pca_result$sdev^2) / sum(pca_result$sdev^2)
cumulative_variance_df <- data.frame(
  Components = seq_along(cumulative_variance),
  CumulativeVariance = cumulative_variance
)


  ggplot(cumulative_variance_df, aes(x = Components, y = CumulativeVariance)) +
  geom_line() +
  geom_point() +
  labs(x = "Number of Components", 
       y = "Cumulative Variance", 
       title = "Cumulative Variance Explained by PCA Components") +
  geom_hline(yintercept = 0.95, linetype = "dashed", color = "red") +
  theme_minimal()

  ggplot(cumulative_variance_df, aes(x = Components, y = CumulativeVariance)) +
  geom_line() +
  geom_point() +
  labs(x = "Number of Components", 
       y = "Cumulative Variance", 
       title = "Cumulative Variance Explained by PCA Components") +
  geom_hline(yintercept = 0.95, linetype = "dashed", color = "red") +
  theme_minimal()

ggsave(“Newcumvar.png”, width = 6, height = 4)

In the Cumulative Variance PCA graph, the plot illustrates how much of the total variance in energy consumption can be explained by the first few principal components. The first few components account for more than 80% of the variance, indicating that most of the variation in household energy usage is driven by a few key factors, such as the time of day and its patterns in relation to appliance usage.

PCA, or Principal Component Analysis is a statistical technique used to determine which variables affect the variances of the observations more so than other variables.

These variables become principal components, and they demonstrate the patterns of the data without any correlated variables affecting the values of the data.

When comparing the first two principal components (PC1 and PC2), one can observe how different clusters groups of appliances relate to one another.

The position of each dot reveals how that particular observation correlates strongly with other observations in relation to the principal components.

Observations that are closer together are more similar to each other based on the attributes that were calculated in the PCA.

The closer these observations are to each other, the more reliable the indicated patterns from the PCA are.

Such convictions that indicate these relations solidify arguments for energy management strategies amongst appliances within households, especially when some of these appliances are dependent on each other.

loadings_df <- as.data.frame(pca_result$rotation)
loadings_df$Variable <- rownames(loadings_df) 

ggplot(reduced_data_df, aes(x = PC1, y = PC2, color = Average_df_notime_Clusters$Cluster)) +
  geom_point(size = 1, alpha = 0.7) +  
  theme_minimal() +
  labs(title = "K-means Clustering on PCA Results", 
       x = "PC1", 
       y = "PC2", 
       color = "Cluster") + 
  theme_minimal() +
  theme(legend.position = "right")
  scale_color_brewer(palette = "Spectral")

ggsave(“PCA.png”, width = 6, height = 4)

In the PCA Scatter Plot, the distribution of clusters is heterogeneous rather than homogeneous. The dots are not evenly spread across the graph, indicating that these clusters have distinct and varying energy consumption behaviors that are different from other cluster groups.

In order to understand the strengths of the PCAs for the independent variables in relation to their respective observations, a biplot can be used to visualize such calculations.

A biplot combines both the PCA graph and the variables in one graph, allowing for interpretation of how strongly each variable contributes to the principal components through the directions and lengths of the vectors.

Longer vectors indicate a stronger relationship between the variables represented by these vectors and the principal components derived from the dataset.

By interpreting the vectors. it becomes possible to understand which appliances contribute to energy consumption with the most significance and correlation.

As a result, decisions could be made regarding energy management to optimize energy usage during certain hours of the day.

ggplot(reduced_data_df, aes(x = PC1, y = PC2)) +
  geom_point(color = "blue") +  
  geom_segment(data = loadings_df, aes(xend = PC1, yend = PC2), 
               x = 0, y = 0, 
               arrow = arrow(type = "open", length = unit(1, "inches")), 
               color = "red") +   
geom_text_repel(data = loadings_df, aes(label = Variable), 
                color = "red", size = 5, 
                max.overlaps = Inf) +   
  labs(title = "PCA Biplot", x = "PC1", y = "PC2") +
  theme_minimal()

ggsave(“Biplot.png”,width = 6, height = 4)

In the biplot, appliances such as kettles and microwaves exhibit high variability, reflecting their usage patterns since these appliances are only used for temporary amounts of time, while refrigerators show less variation due to their constant operation in maintaining the temperature of household goods.

Association rules are a method used in data analysis to find interesting relationships between different variables in a dataset. They help identify patterns that show how items are related to each other based on their occurrences.

Association rules are essentially “If-Then” statements, indicating that the presence of certain patterns under certain conditions implies the presence of similar patterns during other conditions.

The presence of the previous condition is also known as the antecedent and the consequent is the sequential condition that proceeds after the antecedent.The validity of such rules is measured using metrics such as support, confidence, and lift.

Support indicates how frequently the conditions appear together in the dataset, confidence measures the likelihood of the consequent after the antecedent, and lift assesses the strength of the association rule relative to the expected occurrence of the consequent.

For example, an association rule could be proven if a household demands higher than average levels of energy during 18:00 to 19:00, then they are likely to also demand higher than average levels of energy during the 19:00 to 20:00. This type of scenario is known as the sequential association rule.

The sequential association rule is when one identifies relationships between certain events or occurrences over a period of time. In essence, the sequential association rule captures the likelihood that the occurrence of a particular event will lead to the occurrence of another event.

In order to understand how these association rules impact such likelihoods among the appliances, the Apriori method will be conducted in order to analyze consumption patterns and relationships between different energy appliances.

In addition, a parallel coordinates plot, a scatter plot for association rules, and a matrix plot for association rules will be utilized to visualize the association rules for the appliances in this dataset.

          Further Documentation:  https://cran.r-project.org/web/packages/arules/arules.pdf  


rules <- apriori(Average_df_notime, parameter = list(supp = 0.2, conf = 0.6))

In the parallel coordinates plot the values of support from the Apriori method indicate how frequently certain combinations of appliances are used together during certain hours of the day.

The longer the lines in the parallel coordinates plot, the stronger the association between the appliances are and the higher the frequency of usage of such appliances.

The Right-Hand Side (RHS) represents the outcome of a usage rule and the Left-Hand Side (LHS) indicates the conditions leading to that particular outcome.

By analyzing these relationships among the appliances, utility companies can identify peak usage patterns and focus their attention on optimizing energy consumption among households for such combinations.

          Further Documentation: https://www.rdocumentation.org/packages/arulesViz/versions/1.1-0/topics/plot
 

 plot(rules, method = "paracoord", control = list(reorder = TRUE, alpha = 0.7, col = rainbow(10)))  

There were rendering issues with this image of the plot while using ggplot. As such, a third party url was utilized.

  <img src="https://i.imghippo.com/files/flF7335Y.png" alt="" border="0">

Appliances such as microwaves and kettles had the longest lines in this parallel coordinates plot, indicating they are more likely to be used together frequently than other appliances, hence a co-occurrence.

This is most likely due to both appliances having usage during meal preparation.

summary(rules)
  
 
       
set of 940 rules

rule length distribution (lhs + rhs):sizes
1   2   3   4   5   6   7 
7  91 263 322 193  57   7 

 Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
1.000   3.000   4.000   3.853   5.000   7.000 

summary of quality measures:
support         confidence        coverage           lift            count     
Min.   :0.2000   Min.   :0.6141   Min.   :0.2000   Min.   :0.9211   Min.   :1757  
1st Qu.:0.2196   1st Qu.:0.7987   1st Qu.:0.2462   1st Qu.:1.0000   1st Qu.:1929  
Median :0.3333   Median :0.9177   Median :0.3333   Median :1.1488   Median :2928  
Mean   :0.3529   Mean   :0.8900   Mean   :0.4033   Mean   :1.1688   Mean   :3100  
3rd Qu.:0.4647   3rd Qu.:1.0000   3rd Qu.:0.5162   3rd Qu.:1.3071   3rd Qu.:4082  
Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :2.3248   Max.   :8784  

mining info:
                data ntransactions support confidence
Average_df_notime          8784     0.2        0.6
                                                                    call
apriori(data = Average_df_notime, parameter = list(supp = 0.2, conf = 0.6))

The quality measures of these rules provide insights into their significance.

Support, which represents the proportion of transactions where a rule applies, ranges from 0.2 to 1.0.

This means that the least frequent rules appear in at least 20% of the dataset.The median support value of 0.3333 indicates that these rules hold true in about 33.33% of all observations from the dataset.

Confidence, which measures how often the rule is correct when its left-hand side (LHS) conditions are met, has a median value of 0.9177, meaning that these rules hold true in about 91.77 % of all observations from the dataset likelihood when their antecedent conditions are present.

Lift, which indicates the strength of the relationship between items, ranges from 0.9211 to 2.3248, with a median lift of 1.1488. This indicates that these items have a strong association with one another, given that the lift is greater than 1.

The scatter plot for Association Rules utilized the confidence values derived from the Apriori method in order to interpret the relationships between appliances.

Each point on the scatter plot can reflect instances in which certain appliances are used together at certain times, especially when there is high confidence indicated.

The support indicates the overall frequency of usage for the combination of these appliances during specific time frames, while confidence reflects the likelihood that both appliances are used together given one is already in use,

As clusters are visualized in the scatter plot, the dots with high confidence indicates that certain association rules do correlate with being used during certain hours of the day.

plot(rules, method = "scatter", measure = "support", shading = "confidence",
       main = "Scatter Plot of Association Rules")
   

ggsave(“Scatterplot_Association.png”, width = 6, height = 4)

In observing the scatter plot, one can observe clusters with high support and high confidence, which indicate strong and frequent occurring rules. These rules suggest co-occurrence amongst appliances. Meanwhile, rules with low support but high confidence indicate strong relationships but just don’t occur as freqnently.

The matrix plot for Association Rules displays confidence values derived from the Apriori method that examine relationships between appliances as well.

However, in the matrix plot, each cell in the matrix represents instances where specific appliances are used simultaneously during certain times of the day, especially with high confidence.

The Left-Hand-Side(LHS) of a rule consists of energy appliances in which under certain conditions would be used. The Right-Hand-Side(RHS) represents the complementary appliance that would be used under the rule regarding appliance usage.

To further clarify, a table displaying the association rules formats the appliances starting from the lift values of the antecedent followed by the lift values of the consequent.

The lift values dictate the correlation of this particular rule from the table in regards to the usage of the appliances within the rule. A lift value greater than 1 would imply a positive correlation, indicating that these appliances would be used after those appliances are used.

By knowing these rules, energy management could become more efficient in targeting insufficiency particles among these appliances in order to optimize energy management since these rules indicate which appliances are associated with each other.

plot(rules, method = "matrix", measure = c("support", "lift"),
       main = "Matrix Plot of Association Rules")
       

ggsave(“Matrix.png”, width = 6, height = 4)

In the matrix association rules plot, the lift values in the matrix indicate the likelihood of an appliance being used given that another appliance is already in use.

Lift values above 1 indicate positive correlation. And in this plot, there are several rules that have a lift value above 1.

This indicates that many of the appliances are used simultaneously.