Karena PCA sensitif terhadap skala, data biasanya diubah menjadi z-score agar tiap variabel punya rata-rata 0 dan standar deviasi 1:
\(z_{ij} = \frac{x_{ij} - \bar{x}_j}{s_j}\)
Matriks kovarians:
\(\Sigma = \frac{1}{n-1} Z^T Z\)
Lakukan dekomposisi:
\(\Sigma v = \lambda v\)
Hitung skor komponen utama dengan mengalikan data terstandardisasi \(Z\) dengan eigenvector \(V\):
\(PC = Z \cdot V\)
#install.packages('FactoMineR')
#install.packages("factoextra")
library(FactoMineR) # untuk analisis PCA
library(factoextra) # untuk visualisasi PCA
library(ggplot2) # untuk plotting data
df <- read.csv('breast-cancer.csv');head(df) # membaca data breast-cancer.csv dan menampilkan head data
str(df) # cek informasi data: tipe data per kolom, jumlah fitur dalam data frame
'data.frame': 569 obs. of 32 variables:
$ id : int 842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
$ diagnosis : chr "M" "M" "M" "M" ...
$ radius_mean : num 18 20.6 19.7 11.4 20.3 ...
$ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ...
$ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ...
$ area_mean : num 1001 1326 1203 386 1297 ...
$ smoothness_mean : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
$ compactness_mean : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
$ concavity_mean : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
$ concave.points_mean : num 0.1471 0.0702 0.1279 0.1052 0.1043 ...
$ symmetry_mean : num 0.242 0.181 0.207 0.26 0.181 ...
$ fractal_dimension_mean : num 0.0787 0.0567 0.06 0.0974 0.0588 ...
$ radius_se : num 1.095 0.543 0.746 0.496 0.757 ...
$ texture_se : num 0.905 0.734 0.787 1.156 0.781 ...
$ perimeter_se : num 8.59 3.4 4.58 3.44 5.44 ...
$ area_se : num 153.4 74.1 94 27.2 94.4 ...
$ smoothness_se : num 0.0064 0.00522 0.00615 0.00911 0.01149 ...
$ compactness_se : num 0.049 0.0131 0.0401 0.0746 0.0246 ...
$ concavity_se : num 0.0537 0.0186 0.0383 0.0566 0.0569 ...
$ concave.points_se : num 0.0159 0.0134 0.0206 0.0187 0.0188 ...
$ symmetry_se : num 0.03 0.0139 0.0225 0.0596 0.0176 ...
$ fractal_dimension_se : num 0.00619 0.00353 0.00457 0.00921 0.00511 ...
$ radius_worst : num 25.4 25 23.6 14.9 22.5 ...
$ texture_worst : num 17.3 23.4 25.5 26.5 16.7 ...
$ perimeter_worst : num 184.6 158.8 152.5 98.9 152.2 ...
$ area_worst : num 2019 1956 1709 568 1575 ...
$ smoothness_worst : num 0.162 0.124 0.144 0.21 0.137 ...
$ compactness_worst : num 0.666 0.187 0.424 0.866 0.205 ...
$ concavity_worst : num 0.712 0.242 0.45 0.687 0.4 ...
$ concave.points_worst : num 0.265 0.186 0.243 0.258 0.163 ...
$ symmetry_worst : num 0.46 0.275 0.361 0.664 0.236 ...
$ fractal_dimension_worst: num 0.1189 0.089 0.0876 0.173 0.0768 ...
summary(df) # statistik deskriptif tiap fitur
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean
Min. : 8670 Length:569 Min. : 6.981 Min. : 9.71 Min. : 43.79 Min. : 143.5 Min. :0.05263 Min. :0.01938
1st Qu.: 869218 Class :character 1st Qu.:11.700 1st Qu.:16.17 1st Qu.: 75.17 1st Qu.: 420.3 1st Qu.:0.08637 1st Qu.:0.06492
Median : 906024 Mode :character Median :13.370 Median :18.84 Median : 86.24 Median : 551.1 Median :0.09587 Median :0.09263
Mean : 30371831 Mean :14.127 Mean :19.29 Mean : 91.97 Mean : 654.9 Mean :0.09636 Mean :0.10434
3rd Qu.: 8813129 3rd Qu.:15.780 3rd Qu.:21.80 3rd Qu.:104.10 3rd Qu.: 782.7 3rd Qu.:0.10530 3rd Qu.:0.13040
Max. :911320502 Max. :28.110 Max. :39.28 Max. :188.50 Max. :2501.0 Max. :0.16340 Max. :0.34540
concavity_mean concave.points_mean symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se
Min. :0.00000 Min. :0.00000 Min. :0.1060 Min. :0.04996 Min. :0.1115 Min. :0.3602 Min. : 0.757
1st Qu.:0.02956 1st Qu.:0.02031 1st Qu.:0.1619 1st Qu.:0.05770 1st Qu.:0.2324 1st Qu.:0.8339 1st Qu.: 1.606
Median :0.06154 Median :0.03350 Median :0.1792 Median :0.06154 Median :0.3242 Median :1.1080 Median : 2.287
Mean :0.08880 Mean :0.04892 Mean :0.1812 Mean :0.06280 Mean :0.4052 Mean :1.2169 Mean : 2.866
3rd Qu.:0.13070 3rd Qu.:0.07400 3rd Qu.:0.1957 3rd Qu.:0.06612 3rd Qu.:0.4789 3rd Qu.:1.4740 3rd Qu.: 3.357
Max. :0.42680 Max. :0.20120 Max. :0.3040 Max. :0.09744 Max. :2.8730 Max. :4.8850 Max. :21.980
area_se smoothness_se compactness_se concavity_se concave.points_se symmetry_se fractal_dimension_se
Min. : 6.802 Min. :0.001713 Min. :0.002252 Min. :0.00000 Min. :0.000000 Min. :0.007882 Min. :0.0008948
1st Qu.: 17.850 1st Qu.:0.005169 1st Qu.:0.013080 1st Qu.:0.01509 1st Qu.:0.007638 1st Qu.:0.015160 1st Qu.:0.0022480
Median : 24.530 Median :0.006380 Median :0.020450 Median :0.02589 Median :0.010930 Median :0.018730 Median :0.0031870
Mean : 40.337 Mean :0.007041 Mean :0.025478 Mean :0.03189 Mean :0.011796 Mean :0.020542 Mean :0.0037949
3rd Qu.: 45.190 3rd Qu.:0.008146 3rd Qu.:0.032450 3rd Qu.:0.04205 3rd Qu.:0.014710 3rd Qu.:0.023480 3rd Qu.:0.0045580
Max. :542.200 Max. :0.031130 Max. :0.135400 Max. :0.39600 Max. :0.052790 Max. :0.078950 Max. :0.0298400
radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave.points_worst
Min. : 7.93 Min. :12.02 Min. : 50.41 Min. : 185.2 Min. :0.07117 Min. :0.02729 Min. :0.0000 Min. :0.00000
1st Qu.:13.01 1st Qu.:21.08 1st Qu.: 84.11 1st Qu.: 515.3 1st Qu.:0.11660 1st Qu.:0.14720 1st Qu.:0.1145 1st Qu.:0.06493
Median :14.97 Median :25.41 Median : 97.66 Median : 686.5 Median :0.13130 Median :0.21190 Median :0.2267 Median :0.09993
Mean :16.27 Mean :25.68 Mean :107.26 Mean : 880.6 Mean :0.13237 Mean :0.25427 Mean :0.2722 Mean :0.11461
3rd Qu.:18.79 3rd Qu.:29.72 3rd Qu.:125.40 3rd Qu.:1084.0 3rd Qu.:0.14600 3rd Qu.:0.33910 3rd Qu.:0.3829 3rd Qu.:0.16140
Max. :36.04 Max. :49.54 Max. :251.20 Max. :4254.0 Max. :0.22260 Max. :1.05800 Max. :1.2520 Max. :0.29100
symmetry_worst fractal_dimension_worst
Min. :0.1565 Min. :0.05504
1st Qu.:0.2504 1st Qu.:0.07146
Median :0.2822 Median :0.08004
Mean :0.2901 Mean :0.08395
3rd Qu.:0.3179 3rd Qu.:0.09208
Max. :0.6638 Max. :0.20750
sum(is.na(df)) # cek jumlah data null
[1] 0
sum(duplicated(df)) # cek jumlah data yang duplikat
[1] 0
df_s <- subset(df, select = -c(id, diagnosis)) # menyimpan semua fitur selain id dan diagnosis
df_labels <- df$diagnosis # menyimpan fitur diagnosis ke satu variabel, berbentuk vektor
Karena PCA sensitif terhadap skala data, semua fitur kita standarisasi menjadi rata-rata 0 dan standar deviasi 1. Tujuannya agar fitur besar (misalnya area) tidak mendominasi fitur kecil (misalnya smootness).
df_scaled <- scale(df_s) # scaling data selain diagnosis dan id, kemudian simpan ke variabel baru (data frame baru)
Dengan standardisasi, variabel-variabel yang berbeda skala (misalnya area yang angkanya jauh lebih besar daripada smoothness) memiliki kontribusi seimbang. Ini memastikan PCA tidak bias terhadap fitur berskala besar. |
cov_matrix <- cov(df_scaled)
print("Covariance Matrix (5x5):")
print(cov_matrix[1:5, 1:5]) # hapus dimensinya untuk melihat matriks penuh
Banyak fitur membawa informasi yang mirip (redundan). Hal ini mendukung penggunaan PCA karena PCA dapat menggabungkan variabel-variabel yang berkorelasi tinggi menjadi satu komponen utama.
# melakukan dekomposisi eigen pada matriks kovarians
eig <- eigen(cov_matrix)
# menyimpan nilai eigen (eigenvalues) → menunjukkan besar variansi
# yang dijelaskan oleh masing-masing komponen utama
eig_values <- eig$values
# menyimpan vektor eigen (eigenvectors) → arah dari komponen utama,
# dipakai untuk membentuk kombinasi linier variabel asli
eig_vectors <- eig$vectors
print("Eigenvalues (10 pertama):")
[1] "Eigenvalues (10 pertama):"
print(eig_values[1:10])
[1] 13.2816077 5.6913546 2.8179490 1.9806405 1.6487305 1.2073566 0.6752201 0.4766171 0.4168948 0.3506935
print(eig_vectors[1:5, 1:5])
[,1] [,2] [,3] [,4] [,5]
[1,] -0.2189024 -0.23385713 -0.008531243 0.04140896 0.03778635
[2,] -0.1037246 -0.05970609 0.064549903 -0.60305000 -0.04946885
[3,] -0.2275373 -0.21518136 -0.009314220 0.04198310 0.03737466
[4,] -0.2209950 -0.23107671 0.028699526 0.05343380 0.01033125
[5,] -0.1425897 0.18611302 -0.104291904 0.15938277 -0.36508853
Explained variance ratio = proporsi variansi yang dijelaskan tiap PC.
explained_var_ratio <- eig_values / sum(eig_values)
print("Explained variance ratio (10 pertama):")
[1] "Explained variance ratio (10 pertama):"
print(explained_var_ratio[1:10])
[1] 0.44272026 0.18971182 0.09393163 0.06602135 0.05495768 0.04024522 0.02250734 0.01588724 0.01389649 0.01168978
Cumulative variance = total variansi yang sudah dijelaskan sampai PC tertentu.
cumulative_var <- cumsum(explained_var_ratio)
print("Cumulative variance (10 pertama):")
[1] "Cumulative variance (10 pertama):"
print(cumulative_var[1:10])
[1] 0.4427203 0.6324321 0.7263637 0.7923851 0.8473427 0.8875880 0.9100953 0.9259825 0.9398790 0.9515688
Scree plot adalah sebuah grafik yang digunakan dalam analisis komponen utama (PCA) untuk membantu menentukan jumlah komponen utama yang akan dipertahankan.
qplot(y = cumulative_var, x = 1:length(cumulative_var),
geom = c("line", "point"),
xlab = "Number of Components",
ylab = "Cumulative Explained Variance")