if(!require(psych)) install.packages("psych")
## Loading required package: psych
if(!require(FactoMineR)) install.packages("FactoMineR")
## Loading required package: FactoMineR
if(!require(factoextra)) install.packages("factoextra")
## Loading required package: factoextra
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
if(!require(dplyr)) install.packages("dplyr")
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
if(!require(ucimlrepo)) install.packages("ucimlrepo")
## Loading required package: ucimlrepo
library(psych) # MSA, PCA, FA
library(dplyr) # wrangling
library(FactoMineR) #PCA
library(factoextra) #visualization
library(ucimlrepo) #data
# fetch dataset
room_occupancy_estimation = fetch_ucirepo(id=864)
# data (as pandas dataframes)
X = room_occupancy_estimation$data$features
y = room_occupancy_estimation$data$targets
# metadata
print(room_occupancy_estimation$metadata)
## $uci_id
## [1] 864
##
## $name
## [1] "Room Occupancy Estimation"
##
## $repository_url
## [1] "https://archive.ics.uci.edu/dataset/864/room+occupancy+estimation"
##
## $data_url
## [1] "https://archive.ics.uci.edu/static/public/864/data.csv"
##
## $abstract
## [1] "Data set for estimating the precise number of occupants in a room using multiple non-intrusive environmental sensors like temperature, light, sound, CO2 and PIR."
##
## $area
## [1] "Computer Science"
##
## $tasks
## $tasks[[1]]
## [1] "Classification"
##
##
## $characteristics
## $characteristics[[1]]
## [1] "Multivariate"
##
## $characteristics[[2]]
## [1] "Time-Series"
##
##
## $num_instances
## [1] 10129
##
## $num_features
## [1] 18
##
## $feature_types
## $feature_types[[1]]
## [1] "Real"
##
##
## $demographics
## list()
##
## $target_col
## $target_col[[1]]
## [1] "Room_Occupancy_Count"
##
##
## $index_col
## NULL
##
## $has_missing_values
## [1] "no"
##
## $missing_values_symbol
## NULL
##
## $year_of_dataset_creation
## [1] 2018
##
## $last_updated
## [1] "Wed Aug 16 2023"
##
## $dataset_doi
## [1] "10.24432/C5P605"
##
## $creators
## $creators[[1]]
## [1] "Adarsh Pal Singh"
##
## $creators[[2]]
## [1] "Sachin Chaudhari"
##
##
## $intro_paper
## $intro_paper$ID
## [1] 275
##
## $intro_paper$type
## [1] "NATIVE"
##
## $intro_paper$title
## [1] "Machine Learning-Based Occupancy Estimation Using Multivariate Sensor Nodes"
##
## $intro_paper$authors
## [1] "A. Singh, Vivek Jain, S. Chaudhari, F. Kraemer, S. Werner, V. Garg"
##
## $intro_paper$venue
## [1] "2018 IEEE Globecom Workshops (GC Wkshps)"
##
## $intro_paper$year
## [1] 2018
##
## $intro_paper$journal
## NULL
##
## $intro_paper$DOI
## NULL
##
## $intro_paper$URL
## [1] "https://www.semanticscholar.org/paper/e631ea26f0fd88541f42b4e049d63d6b52d6d3ac"
##
## $intro_paper$sha
## NULL
##
## $intro_paper$corpus
## NULL
##
## $intro_paper$arxiv
## NULL
##
## $intro_paper$mag
## NULL
##
## $intro_paper$acl
## NULL
##
## $intro_paper$pmid
## NULL
##
## $intro_paper$pmcid
## NULL
##
##
## $additional_info
## $additional_info$summary
## [1] "The experimental testbed for occupancy estimation was deployed in a 6m x 4.6m room. The setup consisted of 7 sensor nodes and one edge node in a star configuration with the sensor nodes transmitting data to the edge every 30s using wireless transceivers. No HVAC systems were in use while the dataset was being collected.\n\nFive different types of non-intrusive sensors were used in this experiment: temperature, light, sound, CO2 and digital passive infrared (PIR). The CO2, sound and PIR sensors needed manual calibration. For the CO2 sensor, zero-point calibration was manually done before its first use by keeping it in a clean environment for over 20 minutes and then pulling the calibration pin (HD pin) low for over 7s. The sound sensor is essentially a microphone with a variable-gain analog amplifier attached to it. Therefore, the output of this sensor is analog which is read by the microcontroller’s ADC in volts. The potentiometer tied to the gain of the amplifier was adjusted to ensure the highest sensitivity. The PIR sensor has two trimpots: one to tweak the sensitivity and the other to tweak the time for which the output stays high after detecting motion. Both of these were adjusted to the highest values. Sensor nodes S1-S4 consisted of temperature, light and sound sensors, S5 had a CO2 sensor and S6 and S7 had one PIR sensor each that were deployed on the ceiling ledges at an angle that maximized the sensor’s field of view for motion detection.\n\nThe data was collected for a period of 4 days in a controlled manner with the occupancy in the room varying between 0 and 3 people. The ground truth of the occupancy count in the room was noted manually.\n\nPlease refer to our publications for more details."
##
## $additional_info$purpose
## NULL
##
## $additional_info$funded_by
## NULL
##
## $additional_info$instances_represent
## NULL
##
## $additional_info$recommended_data_splits
## NULL
##
## $additional_info$sensitive_data
## NULL
##
## $additional_info$preprocessing_description
## NULL
##
## $additional_info$variable_info
## [1] "Date: YYYY/MM/DD\nTime: HH:MM:SS\nTemperature: In degree Celsius\nLight: In Lux\nSound: In Volts (amplifier output read by ADC)\nCO2: In PPM\nCO2 Slope: Slope of CO2 values taken in a sliding window\nPIR: Binary value conveying motion detection\nRoom_Occupancy_Count: Ground Truth"
##
## $additional_info$citation
## [1] "If you use this dataset in your research, please cite the following paper:\nAdarsh Pal Singh, Vivek Jain, Sachin Chaudhari, Frank Alexander Kraemer, Stefan Werner and Vishal Garg, \"Machine Learning-Based Occupancy Estimation Using Multivariate Sensor Nodes,\" in 2018 IEEE Globecom Workshops (GC Wkshps), 2018."
# variable information
print(room_occupancy_estimation$variables)
## name role type demographic
## 1 Date Feature Date NA
## 2 Time Feature Date NA
## 3 S1_Temp Feature Continuous NA
## 4 S2_Temp Feature Continuous NA
## 5 S3_Temp Feature Continuous NA
## 6 S4_Temp Feature Continuous NA
## 7 S1_Light Feature Integer NA
## 8 S2_Light Feature Integer NA
## 9 S3_Light Feature Integer NA
## 10 S4_Light Feature Integer NA
## 11 S1_Sound Feature Continuous NA
## 12 S2_Sound Feature Continuous NA
## 13 S3_Sound Feature Continuous NA
## 14 S4_Sound Feature Continuous NA
## 15 S5_CO2 Feature Integer NA
## 16 S5_CO2_Slope Feature Continuous NA
## 17 S6_PIR Feature Binary NA
## 18 S7_PIR Feature Integer NA
## 19 Room_Occupancy_Count Target Integer NA
## description units missing_values
## 1 <NA> YYYY/MM/DD no
## 2 <NA> HH:MM:SS no
## 3 <NA> C no
## 4 <NA> C no
## 5 <NA> C no
## 6 <NA> C no
## 7 <NA> Lux no
## 8 <NA> Lux no
## 9 <NA> Lux no
## 10 <NA> Lux no
## 11 amplifier output read by ADC Volts no
## 12 amplifier output read by ADC Volts no
## 13 amplifier output read by ADC Volts no
## 14 amplifier output read by ADC Volts no
## 15 <NA> PPM no
## 16 Slope of CO2 values taken in a sliding window <NA> no
## 17 Binary value conveying motion detection <NA> no
## 18 Binary value conveying motion detection <NA> no
## 19 Ground Truth <NA> no
data = X |>
select(-Date, -Time, -S6_PIR, -S7_PIR)
str(data)
## 'data.frame': 10129 obs. of 14 variables:
## $ S1_Temp : num 24.9 24.9 25 25 25 ...
## $ S2_Temp : num 24.8 24.8 24.8 24.8 24.8 ...
## $ S3_Temp : num 24.6 24.6 24.5 24.6 24.6 ...
## $ S4_Temp : num 25.4 25.4 25.4 25.4 25.4 ...
## $ S1_Light : int 121 121 121 121 121 121 120 121 122 101 ...
## $ S2_Light : int 34 33 34 34 34 34 34 34 35 34 ...
## $ S3_Light : int 53 53 53 53 54 54 54 54 56 57 ...
## $ S4_Light : int 40 40 40 40 40 40 40 41 43 43 ...
## $ S1_Sound : num 0.08 0.93 0.43 0.41 0.18 0.13 1.39 0.09 0.09 3.84 ...
## $ S2_Sound : num 0.19 0.05 0.11 0.1 0.06 0.06 0.32 0.06 0.05 0.64 ...
## $ S3_Sound : num 0.06 0.06 0.08 0.1 0.06 0.06 0.43 0.09 0.06 0.48 ...
## $ S4_Sound : num 0.06 0.06 0.06 0.09 0.06 0.07 0.06 0.05 0.13 0.39 ...
## $ S5_CO2 : int 390 390 390 390 390 390 390 390 390 390 ...
## $ S5_CO2_Slope: num 0.769 0.646 0.519 0.388 0.254 ...
head(data)
## S1_Temp S2_Temp S3_Temp S4_Temp S1_Light S2_Light S3_Light S4_Light S1_Sound
## 1 24.94 24.75 24.56 25.38 121 34 53 40 0.08
## 2 24.94 24.75 24.56 25.44 121 33 53 40 0.93
## 3 25.00 24.75 24.50 25.44 121 34 53 40 0.43
## 4 25.00 24.75 24.56 25.44 121 34 53 40 0.41
## 5 25.00 24.75 24.56 25.44 121 34 54 40 0.18
## 6 25.00 24.81 24.56 25.44 121 34 54 40 0.13
## S2_Sound S3_Sound S4_Sound S5_CO2 S5_CO2_Slope
## 1 0.19 0.06 0.06 390 0.7692308
## 2 0.05 0.06 0.06 390 0.6461538
## 3 0.11 0.08 0.06 390 0.5192308
## 4 0.10 0.10 0.09 390 0.3884615
## 5 0.06 0.06 0.06 390 0.2538462
## 6 0.06 0.06 0.07 390 0.1653846
sum(is.na(data))
## [1] 0
cor_matrix = cor(data)
corrplot::corrplot(cor_matrix, tl.col = "black", tl.srt = 45, tl.cex = 0.5)
Pada korelasi matrix ini sensor suhu, cahaya dan S5_CO2 memiliki multikorelasi. Olehkarena itu mendukung kebutuhan reduksi dengan PCA
r <- cor_matrix
KMO(r)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = r)
## Overall MSA = 0.84
## MSA for each item =
## S1_Temp S2_Temp S3_Temp S4_Temp S1_Light S2_Light
## 0.76 0.90 0.83 0.91 0.80 0.83
## S3_Light S4_Light S1_Sound S2_Sound S3_Sound S4_Sound
## 0.86 0.61 0.94 0.88 0.88 0.89
## S5_CO2 S5_CO2_Slope
## 0.86 0.78
Karena semua nilai MSA keseluruhan dan tiap variabel lebih dari 0.5, maka tidak ada yang perlu dihapus
#Bartlett Test
bartlett.test(data)
##
## Bartlett test of homogeneity of variances
##
## data: data
## Bartlett's K-squared = 925428, df = 13, p-value < 2.2e-16
scale_data = scale(data)
pca_result <- PCA(scale_data,
scale.unit = TRUE,
graph = FALSE,
ncp=ncol(data))
pca_result$eig
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 7.50127979 53.5805699 53.58057
## comp 2 1.92016225 13.7154446 67.29601
## comp 3 1.18641170 8.4743693 75.77038
## comp 4 0.75975810 5.4268436 81.19723
## comp 5 0.56778550 4.0556107 85.25284
## comp 6 0.47300854 3.3786324 88.63147
## comp 7 0.45735903 3.2668502 91.89832
## comp 8 0.32522230 2.3230165 94.22134
## comp 9 0.27120498 1.9371784 96.15852
## comp 10 0.22139950 1.5814250 97.73994
## comp 11 0.12680463 0.9057474 98.64569
## comp 12 0.09707522 0.6933944 99.33908
## comp 13 0.07104183 0.5074417 99.84652
## comp 14 0.02148663 0.1534759 100.00000
Pada data diatas menunjukan eigen value, dari 14 PC. Karena kami akan memilih PC berdasarkan Kaiser Rule maka kedepannya kami akan menggunakan PC1, PC2, dan PC3.
Dengan demikian dengan hanya menggunakan 3PC ini sudah bisa mewakili dari 75,5% data secara keseluruhan
head(pca_result$svd$V)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.3130533 -0.3216278 -0.11255799 -0.11723609 -0.04419216 0.12711820
## [2,] 0.3011914 -0.1958193 0.01702059 -0.02571437 -0.13866143 -0.03178511
## [3,] 0.3009785 -0.3550154 -0.05542094 0.01004438 0.05802103 0.23565225
## [4,] 0.2807664 -0.3226108 0.04545622 0.13964287 0.08781231 0.48349652
## [5,] 0.3192166 0.1094353 0.23449429 -0.14001930 -0.08303463 -0.19148369
## [6,] 0.2928273 0.1500405 0.24745281 -0.21800496 -0.25758701 -0.31064021
## [,7] [,8] [,9] [,10] [,11] [,12]
## [1,] 0.05925640 -0.09072584 -0.06084163 -0.00272259 -0.2082511 0.2196291
## [2,] -0.15057599 0.67542419 0.45551345 0.06698789 -0.3118697 -0.2289384
## [3,] 0.03495946 -0.09926642 -0.03887623 0.09472554 0.2104149 0.1214650
## [4,] 0.14060151 -0.08760954 -0.06303906 -0.39615228 0.1757092 -0.2528825
## [5,] 0.09068063 -0.15348391 -0.24450427 -0.34325580 -0.6048077 0.2965957
## [6,] -0.26334675 0.14782542 -0.15223787 -0.40959429 0.5328956 -0.1102574
## [,13] [,14]
## [1,] -0.23035913 0.77065247
## [2,] 0.02038820 -0.09385233
## [3,] -0.62876531 -0.49534933
## [4,] 0.51817709 -0.04858680
## [5,] 0.08283273 -0.31298741
## [6,] -0.15153988 0.13742686
Pada Eigen Vector ini menunjukan vektor dari PC dimana yang akan kita gunakan adalah kolom 1, kolom 2 dan kolom3
head(pca_result$ind['coord']$coord[, 1:3])
## Dim.1 Dim.2 Dim.3
## 1 -0.6814220 2.109987 1.954008
## 2 -0.1251871 2.534918 1.534751
## 3 -0.4450482 2.163628 1.729571
## 4 -0.3771934 2.129176 1.590179
## 5 -0.6779797 1.749557 1.867365
## 6 -0.6785629 1.683586 1.845284
Pada perhitungan di atas berisi data baru yang telah direduksi dengan nilai eigen vector yang sudah dikalikan antara (X1 * e11 + X2 * e12 + X3 * e13 + dst)
head(pca_result$var$coord[, 1:3])
## Dim.1 Dim.2 Dim.3
## S1_Temp 0.8574048 -0.4456794 -0.12260101
## S2_Temp 0.8249170 -0.2713467 0.01853926
## S3_Temp 0.8243340 -0.4919445 -0.06036589
## S4_Temp 0.7689761 -0.4470415 0.04951207
## S1_Light 0.8742853 0.1516443 0.25541710
## S2_Light 0.8020089 0.2079110 0.26953185
Berdasarkan analisis korelasi antara PC dan Fiturnya menunjukan bahwa PC1 memiliki korelasi yang kuat antara (S1_Temp, S2_Temp, S3_Temp, S4_Temp, S1_Light, S2_Light, S3_Light, S1_Sound, S2_Sound, S3_Sound, S4_Sound, S5_CO2), pada PC2 memiliki korelasi kuat pada S5_CO2_Slope dan pada PC3 memiliki korelasi kuat pada S4_Light
fviz_eig(pca_result,
addlabels = TRUE,
ncp = ncol(data),
barfill = "skyblue",
barcolor = "darkblue",
linecolor = "red",
)
## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.
Berdasarkan screeplot, sebenarnya hanya dengan menggunakan 2 PC sudah dapat mewakili 67,3% dari keseluruhan data. Namun karena kita menggunakan Kaiser Rule maka kita akan menggunakan 3 PC berdasarkan Eigen Value yang lebih dari 1 pada penjelasan sebelumnya
#Biplot
fviz_pca_biplot(pca_result,
geom.ind = "point",
#col.ind = status.ipm,
#palette = c("#FC4E07","#E7B800", "#00AFBB"),
addEllipses = TRUE,
#legend.title = "Kategori"
)
Persebaran data 2D pada PC1 dan PC2 menunjukan bahwa terlihat arah vectornya secara keseluruhan cenderung ada di PC1 mendatar ke kanan. Sedangkan hanya vector dari S5_CO2_Slope cenderung naik keatas mengarah ke PC2.
Pada S4_Light disini kita melihat vectornya masih tercampur dengan PC1, karena S4_Light baru bisa terlihat dengan jelas di PC3 bukan di visualisasi 2D dari PC1 dan PC2 ini
# correlation circle
contrib_circle <- fviz_pca_var(pca_result, col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE) +
ggtitle("Kontribusi Variabel")
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plot(contrib_circle)
Pada visualisasi 2D dari kontribusi tiap variabel, Sensor suhu paling memiliki kontribusi yang paling besar sekitar 10 yang diwakili dengan warna merah dan S4_Light memiliki kontribusi paling rendah sekitar 3 diwakili dengan warna biru muda
#contribution
contrib_v_PC1 <- fviz_contrib(pca_result, choice = "var", axes = 1, top = 5) + ggtitle("PC1")
plot(contrib_v_PC1)
contrib_v_PC2 <- fviz_contrib(pca_result, choice = "var", axes = 2, top = 5) + ggtitle("PC2")
plot(contrib_v_PC2)
contrib_v_PC3 <- fviz_contrib(pca_result, choice = "var", axes = 3, top = 5) + ggtitle("PC3")
plot(contrib_v_PC3)
Pada PC1 fitur yang paling berpengaruh ada S1_Light, S3_Light, S1_Temp, S2_Temp, dan S3_Temp
Pada PC2 fitur yang paling berpengaruh ada S5_CO2_Slope, S3_Temp, S4_Temp, S1_Temp, dan S4_Sound
Pada PC3 fitur yang paling berpengaruh adalah S4_Light, S4_Sound, dan S3_Sound
#FA
fa <- fa(cor(scale_data), nfactors = 3, rotate = "oblimin")
## Loading required namespace: GPArotation
fa
## Factor Analysis using method = minres
## Call: fa(r = cor(scale_data), nfactors = 3, rotate = "oblimin")
## Standardized loadings (pattern matrix) based upon correlation matrix
## MR1 MR2 MR3 h2 u2 com
## S1_Temp 0.97 -0.02 0.07 0.98 0.017 1.0
## S2_Temp 0.65 0.24 0.05 0.70 0.300 1.3
## S3_Temp 0.99 -0.01 -0.02 0.95 0.049 1.0
## S4_Temp 0.81 0.13 -0.08 0.73 0.269 1.1
## S1_Light 0.16 0.83 0.04 0.90 0.101 1.1
## S2_Light 0.07 0.79 0.06 0.76 0.244 1.0
## S3_Light 0.24 0.66 0.09 0.77 0.229 1.3
## S4_Light -0.03 0.69 -0.15 0.36 0.643 1.1
## S1_Sound 0.04 0.26 0.51 0.51 0.487 1.5
## S2_Sound -0.03 0.28 0.54 0.51 0.488 1.5
## S3_Sound 0.11 -0.02 0.78 0.67 0.334 1.0
## S4_Sound -0.01 -0.04 0.87 0.71 0.286 1.0
## S5_CO2 0.82 -0.01 0.10 0.75 0.250 1.0
## S5_CO2_Slope -0.33 0.61 0.20 0.38 0.625 1.8
##
## MR1 MR2 MR3
## SS loadings 4.15 3.22 2.30
## Proportion Var 0.30 0.23 0.16
## Cumulative Var 0.30 0.53 0.69
## Proportion Explained 0.43 0.33 0.24
## Cumulative Proportion 0.43 0.76 1.00
##
## With factor correlations of
## MR1 MR2 MR3
## MR1 1.00 0.57 0.42
## MR2 0.57 1.00 0.57
## MR3 0.42 0.57 1.00
##
## Mean item complexity = 1.2
## Test of the hypothesis that 3 factors are sufficient.
##
## df null model = 91 with the objective function = 14.35
## df of the model are 52 and the objective function was 2.17
##
## The root mean square of the residuals (RMSR) is 0.03
## The df corrected root mean square of the residuals is 0.03
##
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy
## MR1 MR2 MR3
## Correlation of (regression) scores with factors 0.95 0.82 0.85
## Multiple R square of scores with factors 0.91 0.68 0.72
## Minimum correlation of possible factor scores 0.82 0.36 0.43
Pada tabel FA ini kontribusi tiap variabel terhadap MR 1, 2, dan 3 semakin terlihat lebih jelas jika dibandingkan dengan korelasi fitur dengan PC hasil PCA sebelumnya. Dimana MR 1 lebih dipengaruhi oleh variabel dari faktor temperatur dan kadar CO2. MR2 dipengaruhi oleh faktor cahaya dan laju perubahan CO2. MR3 dipengaruhi oleh faktor tingkat kebisingan
fa.diagram(fa, cex = 0.7, marg = c(0.5, 4, 1.5, 0.5))
Pada visualisasi ini dapat terlihat dengan jelas bahwa penyusun dari
MR1 = 1 * S3_Temp + 1 * S1_Temp + 0.8 * S5_CO2 + 0.8 * S4_Temp + 0.6 * S2_Temp
MR2 = 0.8 * S1_Light + 0.8 * S2_Light + 0.7 * S4_Light + 0.7 * S3_Light + 0.6 * S5_CO2_Slope
MR3 = 0.9 * S4_Sound + 0.8 * S3_Sound + 0.5 * S2_Sound + 0.5 * S1_Sound