Dimension reduction is a machine learning process of converting data points from a high dimensional space to a reduced space with less number of features. The aim is to minimize the loss of information and preserve as much original variance as possible. This transformation is driven by the need for better generalization as many features in a data set are often related or redundant and it is computationally necessary to optimize processing time. Pricipal Component Analysis (PCA) serves as a fundamental linear approach, identifying new features in a transformed space that capture the maximum variation in the data. However, as PCA can be highly affected by outliers and non-linear relationships, Multidimensional Scaling was also employed. This method focuses on finding dissimilarities between pairs of objects and representing them as distances in a lower-dimension space.
The primary objective of this project is to perform dimensionality reduction on a 25-dimensional data set in order to discover hidden structures while preserving the most of original information. Given that many of the features in this data set are related, including all of them would increase complexity and it wouldn’t likely improve the quality of models. PCA and MDS are performed on the data, then both solutions are compared.
The data set used in this study was sourced form the UCI Machine Learining Repository (https://archive.ics.uci.edu/dataset/374/appliances+energy+prediction). The data was originally collected by Luis M. Candanedo form a house in Stambruges, Belgium, over a period of 4.5 months with a 10-minute sampling interval. It was used for the publication: Data driven prediction models of energy use of appliances in a low-energy house. Luis M. Candanedo, Véronique Feldheim, Dominique Deramaix. Energy and Buildings, Volume 140, 1 April 2017, Pages 81-97, ISSN 0378-7788, http://dx.doi.org/10.1016/j.enbuild.2017.01.083.
The primary objective of the dataset was to predict of appliances’ energy usage based on indoor and outdoor environmental conditions. The monitoring was based on utilizing ZigBee sensors to record temperature and humidity across nine different zones within the house. Additionally, the weather information from a nearest airport was included. The variable descriptions are presented below.
| Variable | Description | Units |
|---|---|---|
| date | Date and time | YYYY-MM-DD HH:MM:SS |
| Appliances | Energy use of appliances | Wh |
| lights | Energy use of light fixtures | Wh |
| T1 | Temperature in kitchen area | °C |
| RH_1 | Humidity in kitchen area | % |
| T2 | Temperature in living room area | °C |
| RH_2 | Humidity in living room area | % |
| T3 | Temperature in laundry room area | °C |
| RH_3 | Humidity in laundry room area | % |
| T4 | Temperature in office room | °C |
| RH_4 | Humidity in office room | % |
| T5 | Temperature in bathroom | °C |
| RH_5 | Humidity in bathroom | % |
| T6 | Temperature outside (north side) | °C |
| RH_6 | Humidity outside (north side) | % |
| T7 | Temperature in ironing room | °C |
| RH_7 | Humidity in ironing room | % |
| T8 | Temperature in teenager room 2 | °C |
| RH_8 | Humidity in teenager room 2 | % |
| T9 | Temperature in parents room | °C |
| RH_9 | Humidity in parents room | % |
| T_out | Temperature outside (weather station) | °C |
| Press_mm_hg | Pressure (weather station) | mm Hg |
| RH_out | Humidity outside (weather station) | % |
| Windspeed | Windspeed (weather station) | m/s |
| Visibility | Visibility (weather station) | km |
| Tdewpoint | Dew point temperature | °C |
| rv1 | Random variable 1 | None |
| rv2 | Random variable 2 | None |
After removing missing values and excluding the timestamp along with noise (rv1,rv2), the final dataset consisted of 19735 observations. Following the inspection of the variables’ distribution plots presented below, some data transformations were necessary. First, the Total_Energy variable was created, which is a sum of energy used by appliances and lights, to represent the total household energy demand. Then, the Total_energy variable was logarithmically transformed due to a significant right-skewed distribution. Lastly, the data were normalized using the Z-score standardization, ensuring a lack of bias caused by different units. The final processed dataset was a 25-dimensional feature space.
To verify whether the data is suitable for PCA dimensionality reduction, a correlation analysis was conducted to examine linear relationships between 25 variables. As presented below, significant positive correlations are observed for variables within the temperature sensors (T1-T9) and humidity sensors (RH). Additionally, temperature variables negatively correlate with RH_6 and RH_out (humidity outside the building). The presence of these strong correlations suggest a high level of information redundancy, which confirms the data set is suitable for PCA.
#Correlation plot
M <- cor(data_scaled)
corrplot(M, type = "lower", order = "hclust", tl.col = "black", tl.cex = 0.5)
To further evaluate the data, the Kaiser-Meyer-Olkin (KMO) Measure of Sampling Adequacy and Bartlett’s Test of Sphericity were performed.
The overall KMO value reached 0.86, which exeeds the recommended threshold of 0.60. The MSA for nearly every variabled remained above 0.70, indicating that the variables share sufficient common variance. The Bartlett’t test was applied to verify that the correlation matrix differs significantly from an identity matrix. The Chi-squared value was very high with 300 degrees of freedom and a significance level of p<0.001. In result we can reject the null hypothesis and confirm that the variables are correlated which justifies the application of dimension reduction techniques.
#Kaiser-Meyer-Olkin
kmo_result <- KMO(data_scaled)
print(kmo_result)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = data_scaled)
## Overall MSA = 0.86
## MSA for each item =
## T1 RH_1 T2 RH_2
## 0.85 0.81 0.78 0.71
## T3 RH_3 T4 RH_4
## 0.95 0.95 0.95 0.92
## T5 RH_5 T6 RH_6
## 0.95 0.87 0.91 0.92
## T7 RH_7 T8 RH_8
## 0.90 0.89 0.92 0.89
## T9 RH_9 T_out Press_mm_hg
## 0.90 0.93 0.76 0.73
## RH_out Windspeed Visibility Tdewpoint
## 0.64 0.73 0.78 0.76
## log_Total_Energy
## 0.62
#Bartlett
n_obs <- nrow(data_scaled)
bartlett_result <- cortest.bartlett(M, n = n_obs)
print(bartlett_result)
## $chisq
## [1] 788935.2
##
## $p.value
## [1] 0
##
## $df
## [1] 300
Given the high dimensionality and the multicollinearity between environmental sensors, Principal Component Analysis (PCA) was employed. The Bartlett’s test and Kaiser-Meyer-Olkin measure further justify using this technique.
To determine the number of components to retain, the Kaiser’s Rule was applied, which suggests that only components with eigenvalues greater than 1.0 should remain. In this analysis, the first five principal components yielded eigenvalues that exceed the threshold.
The output of summary(pca) reveals that the first five components collectively explain 81.6% of the total variance. The level of around 80% variance preservation ensures that the vast majority of data characteristics are retained while the noise is filtered out. The scree plot below visually represents the drop in explained variance for each component. A distinct elbow is observed after the third or fourth component. However, we decide to retain five components, which allows to surpass the 80% mark of cumulative variance explained.
#PCA
pca <- prcomp(data_scaled, center = TRUE, scale = FALSE)
#Kaiser’s rule: eigenvalues>1 should stay; 5
eigen(M)$values
## [1] 9.343079883 7.057992868 1.866189216 1.126422649 1.006660908 0.969885515
## [7] 0.851166599 0.557660229 0.526691247 0.373129758 0.238291838 0.176238105
## [13] 0.144982828 0.140004270 0.117207886 0.114341807 0.093532273 0.073071734
## [19] 0.067187179 0.044887269 0.042379300 0.028547855 0.022545294 0.014204187
## [25] 0.003699303
fviz_eig(pca, choice = "eigenvalue", ncp = 22, barfill = "skyblue", barcolor = "skyblue3", linecolor = "skyblue4", addlabels = TRUE, main = "Eigenvalues")
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.0566 2.6567 1.36609 1.06133 1.00332 0.9848 0.92259
## Proportion of Variance 0.3737 0.2823 0.07465 0.04506 0.04027 0.0388 0.03405
## Cumulative Proportion 0.3737 0.6560 0.73069 0.77575 0.81601 0.8548 0.88886
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.74677 0.72573 0.61084 0.48815 0.41981 0.3808 0.3742
## Proportion of Variance 0.02231 0.02107 0.01493 0.00953 0.00705 0.0058 0.0056
## Cumulative Proportion 0.91116 0.93223 0.94716 0.95669 0.96374 0.9695 0.9751
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.34236 0.33814 0.30583 0.27032 0.25920 0.2119 0.2059
## Proportion of Variance 0.00469 0.00457 0.00374 0.00292 0.00269 0.0018 0.0017
## Cumulative Proportion 0.97982 0.98440 0.98814 0.99106 0.99375 0.9955 0.9972
## PC22 PC23 PC24 PC25
## Standard deviation 0.16896 0.1502 0.11918 0.06082
## Proportion of Variance 0.00114 0.0009 0.00057 0.00015
## Cumulative Proportion 0.99838 0.9993 0.99985 1.00000
#scree plot (explained variance)
fviz_screeplot(pca, addlabels = TRUE, barfill = "skyblue", barcolor = "skyblue3", linecolor = "skyblue4")
The variables factor map illustrates the projection of the 25 features onto the first two principal components (Dim1 and Dim2), which together account for 65.6% of total variance. The first dimension is mostly defined by indoor temperatures (T1-T9) and temperature outside. These vectors are long and align with the positive side of x axis. The vertical axis is dominated by the humidity sensors (RH) with the exception of the bathroom (RH_5) and of the outside (RH_6 and RH_out). Those vectors also point to the opposite direction relative to indoor temperatures, indicating the inverse relationship between outdoor humidity and indoor warmth. The vectors: log_Total_Energy, Visibility, Windspeed, Press_mm_hg and RH_5 are noticeably shorter, which suggests that the significant share of their variance is captured by higher dimension.
#vectors
fviz_pca_var(pca, col.var = "black")
To further understand the structure of the principal components, we analysed the contribution of each variable to the fist five PC. The dashed red lines represents the average contribution if all variables contributed equally - values above the line are considered significant for the given component.
Dimension 1 and 2 represent the house conditions, with dim1 mostly defined by indoor temperature and dimension 2 - indoor humidity levels. Dimension 3 captured the weather conditions such as windspeed, temperature and humidity outside. The fourth one is dominated by the total energy usage, pressure and humidity in the bathroom. The last dimension is very specific; the main contributing variable is Visibility.
#contribution of variables
p1 <- fviz_contrib(pca, choice = "var", axes = 1, fill = "skyblue")
p2 <- fviz_contrib(pca, choice = "var", axes = 2, fill = "salmon")
p3 <- fviz_contrib(pca, choice = "var", axes = 3, fill = "palegreen")
p4 <- fviz_contrib(pca, choice = "var", axes = 4, fill = "plum")
p5 <- fviz_contrib(pca, choice = "var", axes = 5, fill = "wheat")
#layout
p1 + p2 + p3 + p4 + p5 + plot_layout(ncol = 2)
The Quality of Representation plot shows the distribution of 19735 observations in the reduced 2-dimensional space. The shape of the cloud of points shows a distinct core with several offshoots, where observations differ from the average state. The color gradient represents the cos2 values, which indicates how much an individual point’s variance is captured by the first two dimensions. The points close to the center have a lower cos2 value, as their behavior is probably explained by higher dimensions. Red and orange points are well defined by the first two dimensions.
#cos2 plot
fviz_pca_ind(pca, col.ind = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
geom = "point", alpha.ind = 0.5)
For better interpretation of the principal components, a Varimax orthogonal rotation was applied. While the initial PCA maximizes the total variance explained by the first components, creating a complex structure, this procedure rotates the axes with an aim to achieve a simple and more interpretable structure.
Loadings represent the correlation between the original variable and the new component (RC). The cutoff set to 0.4 means showing only relevant correlations. The higher the value, the better a variable defines a RC. The Varimax rotation successfully achieved a simple structure allowing for a clear differentiation of the five components. RC1 (35.4% variance) acts as a temperature indicator, while RC2 (28%) explains the humidity levels. RC3 (9.2%) represents mostly outdoor conditions. RC4 (4.2%) predominantly captures energy level consumption and humidity in the bathroom, while RC5 positively correlates with visibility and the air pressure. The cumulative variance of 81.6% confirms that the 5-dimensional solution provides a accurate representation of the original feature space.
#varimax rotation
res_rot <- principal(data_scaled, nfactors = 5, rotate = "varimax")
#loadings
print(loadings(res_rot), digits = 3, cutoff = 0.4, sort = TRUE)
##
## Loadings:
## RC1 RC2 RC3 RC4 RC5
## T1 0.923
## T2 0.759 0.447
## T3 0.934
## T4 0.932
## T5 0.943
## T6 0.696 0.629
## RH_6 -0.735 0.447
## T7 0.942
## T8 0.897
## T9 0.967
## T_out 0.706 0.619
## RH_1 0.887
## RH_2 0.807
## RH_3 0.907
## RH_4 0.952
## RH_7 0.941
## RH_8 0.924
## RH_9 0.915
## Tdewpoint 0.609 0.636
## RH_out 0.467 -0.715
## Windspeed 0.614
## log_Total_Energy 0.723
## Visibility 0.851
## RH_5 0.486
## Press_mm_hg -0.493 0.449
##
## RC1 RC2 RC3 RC4 RC5
## SS loadings 8.851 7.007 2.307 1.198 1.038
## Proportion Var 0.354 0.280 0.092 0.048 0.042
## Cumulative Var 0.354 0.634 0.727 0.774 0.816
The final quality evaluation was performed by examining the metrics of uniqueness and complexity for variables, as presented on the plots below. The majority of variables representing the indoor temperature and conditions exhibit low unique values (<0.2). A notable outlier is bathroom humidity (RH_5), which has a uniqueness value of over 0.6. This result was expected as the bathroom moisture are driven by events specific to this room, such as showering, which are not captured by the first five components. The Uniqueness vs Complexity scatter plot confirms a successful achievement of simple structure with high concentration of variables within the low-complexity, log-uniqueness quadrant.
The next stage of the analysis involved Multidimensional scaling (MDS) to transform a data set into one with lower number of dimensions, keeping the distances between the points. While PCA focuses on the correlation between variables, MDS aims to represent the dissimilarity as distances between objects.
First, due to computational complexity of MDS, a sample of 1,000 observations was utilized. To determine the optimal number of dimensions, a Stress Plot was generated. The stress value metric represents the mismatch between the actual distances in the 25-dimensional space compared to the reduced MDS space. As shown in the plot, stress value drops as the number of dimensions increases. Following the Kruskal’s guidelines, stress values below 0.2 are considered acceptable. With 5 dimensions, as in previous PCA analysis, the model achieved a stress value of just below 0.2. This indicates that while the data structure is complex, the 5-dimensional solution provides a reliable representation of the similarities between pairs.
#Calculating the stress for dim
set.seed(123)
sample_idx <- sample(1:nrow(data_scaled), 1000)
data_sample <- data_scaled[sample_idx, ]
#distance matrix
d_sample <- dist(data_sample) # data sample
stress_values <- sapply(1:10, function(k) {
fit <- cmdscale(d_sample, k = k, eig = TRUE)
1 - fit$GOF[1] # stress = 1-goodness of fit
})
#Stress Plot
plot(1:10, stress_values, type = "b",
main = "MDS Stress Plot",
xlab = "Number of dimensions k",
ylab = "Stress",
pch = 19, col = "darkred")
abline(h = 0.2, col = "blue", lty = 2) # for d=5 the stress is below 20%
The choice of non-metric MDS (isoMDS) was motivated by the nature of the environmental dataset, as this approach is effective in identifying non-linear relationships and clusters within the 25-dimensional feature space. Rather than focusing on exact numerical distances, non-metric MDS utilizes rank-order of dissimilarities in order to compare similarities. The final calculated stress value for k=5 was 4.77%, which according to Kruskal’s scale indicates an excellent fit.
# non-metric MDS for k=5
mds_iso <- isoMDS(d_sample, k = 5)
## initial value 6.771090
## iter 5 value 5.147631
## iter 10 value 4.942035
## iter 15 value 4.828905
## iter 20 value 4.782320
## iter 20 value 4.779509
## iter 20 value 4.778591
## final value 4.778591
## converged
The Shepard’s plot visually represents how original dissimilarities vary from the distances obtained in the reduced MDS configuration. The points on the plot form a moderately narrow and defined line along the diagonal. Additionally, there are no significant non-linear distortions, suggesting that the five dimensions are sufficient to capture the data.
# Shepards plot
shep <- frustration <- Shepard(d_sample, mds_iso$points)
plot(shep, pch = ".", main = "Shepard Plot (k=5)",
xlab = "Actual distances", ylab = "MDS distances")
lines(shep$x, shep$y, type = "S", col = "red")
Scatter plot of the MDS first two dimensions provides a visual representation of observations. The majority of points form a blue mass in the center. This would represent the standard operation mode of the household, where humidity, temperature and energy follow routine patterns. It is important to note that MDS is less sensitive to the extreme values than previous PCA method. The points highlighted in red are statistical outliers. These points are on edges indicating states different from the norm.
plot(mds_iso$points[,1], mds_iso$points[,2],
col = rgb(0, 0, 1, 0.3), pch = 19,
main = "MDS visualization (Dim 1 & Dim 2)",
xlab = "MDS Dim 1", ylab = "MDS Dim 2")
outliers_mask <- mds_iso$points[,1] < -7
points(mds_iso$points[outliers_mask, 1], mds_iso$points[outliers_mask, 2], col = "red", pch = 19)
To see if the findings were consistent across different method, Procrustes analysis was performed, comparing the MDS configuration to the PCA scores. The Procrustes algorithm rotates and shifts the MDS map to find the best possible fit with the PCA results. Since the MDS was performed on a sample of 1000 observations, corresponding PCA scores were extracted for the same subset to allow for accurate comparison. The calculated sum of squares was 97.74, which indicates a moderate level of agreement between both methods. The plot below shows each point and arrows, which represent the movement required to align MDS with PCA. Most points are clustered in a dense central mass with short arrows, which suggests a high level of consensus between both models.
pca_scores <- as.matrix(res_rot$scores[sample_idx, 1:2])
mds_coords <- as.matrix(mds_iso$points[, 1:2])
fit_proc <- procrustes(pca_scores, mds_coords)
print(fit_proc)
##
## Call:
## procrustes(X = pca_scores, Y = mds_coords)
##
## Procrustes sum of squares:
## 97.74
plot(fit_proc, main = "Procrustes Errors: MDS vs PCA")
The Principal Component Analysis successfully reduced the 25-dimensional dataset into five components, which cumulatively account for 81.6% of the total variance. The application of Varimax rotation was key in achieving a simpler structure, allowing for the clear identification and interpretation of RCs: indoor temperature (RC1), humidity (RC2), outdoor conditions (RC3), energy demand (RC4), and visibility (RC5). An important takeaway from this analysis is the separation of energy consumption from the other RCs. This suggest that energy usage is a reflection of household behavior rather than a correlation with overall conditions. The high quality of representation for most variables and low complexity of the model confirms that the 5 component solution provide an accurate dimension reduction.
The Multidimensional Scaling analysis provided a robust map of the building’s operational modes. Achieving a satisfactory stress value of 4.78%, the 5-dimensional MDS Shepard’s plot confirmed goodness of fit of the model. The visualization revealed a moderately dense core while detecting outliers, as it is a good tool for anomaly detection. The Procrustes analysis (sum of squares = 97.74) confirmed a solid consensus between the linear projections of PCA and non-metric MDS distances. Using two methods in this analysis demonstrates that while household energy demand could be driven by external weather conditions and occupant behavior, it is possible to study this phenomenon on a reduced number of dimensions.