Modul 1 Implementasi Principal Component Analysis (PCA) dan Factor Analysis (FA) pada Parameter Fungsi Hati Berdasarkan HCV Dataset

hcv <- read.csv("hcvdata.csv")

Dataset HCV dimuat ke dalam R. Data ini memuat informasi klinis pasien, termasuk hasil laboratorium dan kategori diagnosis, sehingga bisa dianalisis lebih lanjut.

dim(hcv)

## [1] 615  14

str(hcv)

## 'data.frame':    615 obs. of  14 variables:
##  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Category: chr  "0=Blood Donor" "0=Blood Donor" "0=Blood Donor" "0=Blood Donor" ...
##  $ Age     : int  32 32 32 32 32 32 32 32 32 32 ...
##  $ Sex     : chr  "m" "m" "m" "m" ...
##  $ ALB     : num  38.5 38.5 46.9 43.2 39.2 41.6 46.3 42.2 50.9 42.4 ...
##  $ ALP     : num  52.5 70.3 74.7 52 74.1 43.3 41.3 41.9 65.5 86.3 ...
##  $ ALT     : num  7.7 18 36.2 30.6 32.6 18.5 17.5 35.8 23.2 20.3 ...
##  $ AST     : num  22.1 24.7 52.6 22.6 24.8 19.7 17.8 31.1 21.2 20 ...
##  $ BIL     : num  7.5 3.9 6.1 18.9 9.6 12.3 8.5 16.1 6.9 35.2 ...
##  $ CHE     : num  6.93 11.17 8.84 7.33 9.15 ...
##  $ CHOL    : num  3.23 4.8 5.2 4.74 4.32 6.05 4.79 4.6 4.1 4.45 ...
##  $ CREA    : num  106 74 86 80 76 111 70 109 83 81 ...
##  $ GGT     : num  12.1 15.6 33.2 33.8 29.9 91 16.9 21.5 13.7 15.9 ...
##  $ PROT    : num  69 76.5 79.3 75.7 68.7 74 74.5 67.1 71.3 69.9 ...

Fungsi dim() dan str() digunakan untuk melihat jumlah baris dan kolom serta tipe setiap variabel. Langkah ini membantu memahami struktur dataset sebelum analisis dilakukan.

hcv$X <- NULL

Kolom X dihapus karena biasanya hanya berupa indeks otomatis dari file CSV dan tidak memiliki nilai analitis. Penghapusan kolom ini membuat analisis lebih fokus pada variabel yang relevan.

data_num <- hcv[, sapply(hcv, is.numeric)]

Variabel numerik dipisahkan dari dataset utama agar analisis seperti PCA dan faktor analisis dapat dilakukan secara tepat, karena metode tersebut membutuhkan data numerik.

desc <- data.frame(
  Mean = sapply(data_num, mean, na.rm=TRUE),
  SD = sapply(data_num, sd, na.rm=TRUE),
  Min = sapply(data_num, min, na.rm=TRUE),
  Max = sapply(data_num, max, na.rm=TRUE)
)

desc

##           Mean        SD   Min     Max
## Age  47.408130 10.055105 19.00   77.00
## ALB  41.620195  5.780629 14.90   82.20
## ALP  68.283920 26.028315 11.30  416.60
## ALT  28.450814 25.469689  0.90  325.30
## AST  34.786341 33.090690 10.60  324.00
## BIL  11.396748 19.673150  0.80  254.00
## CHE   8.196634  2.205657  1.42   16.41
## CHOL  5.368099  1.132728  1.43    9.67
## CREA 81.287805 49.756166  8.00 1079.10
## GGT  39.533171 54.661071  4.50  650.90
## PROT 72.044137  5.402636 44.80   90.00

Statistik deskriptif dihitung untuk setiap variabel numerik, termasuk rata-rata, standar deviasi, nilai minimum, dan maksimum, untuk memberikan gambaran umum mengenai distribusi dan sebaran data.

table(hcv$Category)

## 
##          0=Blood Donor 0s=suspect Blood Donor            1=Hepatitis 
##                    533                      7                     24 
##             2=Fibrosis            3=Cirrhosis 
##                     21                     30

prop.table(table(hcv$Category))

## 
##          0=Blood Donor 0s=suspect Blood Donor            1=Hepatitis 
##             0.86666667             0.01138211             0.03902439 
##             2=Fibrosis            3=Cirrhosis 
##             0.03414634             0.04878049

table(hcv$Sex)

## 
##   f   m 
## 238 377

prop.table(table(hcv$Sex))

## 
##         f         m 
## 0.3869919 0.6130081

Distribusi data kategorik dianalisis menggunakan frekuensi dan proporsi. Hal ini menunjukkan persebaran diagnosis pasien serta perbandingan jenis kelamin dalam dataset.

install.packages("reshape2")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)

install.packages("ggplot2", dependencies = TRUE)

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)

library(reshape2)
library(ggplot2)

data_long <- melt(data_num)

## No id variables; using all as measure variables

ggplot(data_long, aes(x = variable, y = value)) +
  geom_boxplot(na.rm = TRUE) +
  theme_minimal() +
  labs(x = "Variabel", y = "Nilai")

Boxplot dibuat untuk memvisualisasikan sebaran tiap variabel numerik sekaligus mendeteksi outlier. Visualisasi ini mempermudah pemahaman pola data sebelum analisis lanjutan.

cor_matrix <- cor(data_num)
cor_matrix

##              Age ALB ALP ALT         AST         BIL         CHE CHOL
## Age   1.00000000  NA  NA  NA  0.08866590  0.03249182 -0.07509348   NA
## ALB           NA   1  NA  NA          NA          NA          NA   NA
## ALP           NA  NA   1  NA          NA          NA          NA   NA
## ALT           NA  NA  NA   1          NA          NA          NA   NA
## AST   0.08866590  NA  NA  NA  1.00000000  0.31223141 -0.20853580   NA
## BIL   0.03249182  NA  NA  NA  0.31223141  1.00000000 -0.33317203   NA
## CHE  -0.07509348  NA  NA  NA -0.20853580 -0.33317203  1.00000000   NA
## CHOL          NA  NA  NA  NA          NA          NA          NA    1
## CREA -0.02229637  NA  NA  NA -0.02138721  0.03122353 -0.01115696   NA
## GGT   0.15308684  NA  NA  NA  0.49126255  0.21702381 -0.11034518   NA
## PROT          NA  NA  NA  NA          NA          NA          NA   NA
##             CREA        GGT PROT
## Age  -0.02229637  0.1530868   NA
## ALB           NA         NA   NA
## ALP           NA         NA   NA
## ALT           NA         NA   NA
## AST  -0.02138721  0.4912625   NA
## BIL   0.03122353  0.2170238   NA
## CHE  -0.01115696 -0.1103452   NA
## CHOL          NA         NA   NA
## CREA  1.00000000  0.1210033   NA
## GGT   0.12100333  1.0000000   NA
## PROT          NA         NA    1

Matriks korelasi dihitung untuk melihat hubungan linear antar variabel numerik, yang menjadi dasar untuk uji asumsi dan analisis faktor.

install.packages("psych")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

KMO(data_num)

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = data_num)
## Overall MSA =  0.63
## MSA for each item = 
##  Age  ALB  ALP  ALT  AST  BIL  CHE CHOL CREA  GGT PROT 
## 0.65 0.66 0.54 0.63 0.56 0.74 0.73 0.65 0.52 0.60 0.60

Uji KMO dilakukan untuk mengecek kelayakan data terhadap analisis faktor. Nilai KMO yang tinggi menunjukkan data cukup memadai untuk analisis faktor.

library(psych)

data_num_clean <- na.omit(data_num)

cortest.bartlett(cor(data_num_clean), n = nrow(data_num_clean))

## $chisq
## [1] 1182.649
## 
## $p.value
## [1] 2.213186e-211
## 
## $df
## [1] 55

Uji Bartlett digunakan untuk memastikan korelasi antar variabel signifikan. Hasil signifikan menunjukkan bahwa asumsi analisis faktor terpenuhi.

data_scaled <- scale(data_num_clean)

Data distandarisasi agar setiap variabel memiliki rata-rata 0 dan standar deviasi 1. Standarisasi ini penting karena PCA dan FA sensitif terhadap perbedaan skala antar variabel.

pca <- prcomp(data_scaled)
summary(pca)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.5667 1.3626 1.1632 1.04041 1.01348 0.85661 0.81827
## Proportion of Variance 0.2231 0.1688 0.1230 0.09841 0.09338 0.06671 0.06087
## Cumulative Proportion  0.2231 0.3919 0.5149 0.61332 0.70669 0.77340 0.83427
##                            PC8     PC9    PC10    PC11
## Standard deviation     0.76309 0.70725 0.65039 0.56348
## Proportion of Variance 0.05294 0.04547 0.03846 0.02886
## Cumulative Proportion  0.88721 0.93268 0.97114 1.00000

PCA dilakukan untuk mereduksi dimensi dan mengidentifikasi kombinasi variabel utama yang menjelaskan variasi terbesar dalam dataset. Hasil summary menunjukkan seberapa besar variasi yang dijelaskan setiap komponen utama.

library(psych)

fa.parallel(data_scaled)

## Parallel analysis suggests that the number of factors =  5  and the number of components =  3

Fungsi ini digunakan untuk menentukan jumlah faktor optimal berdasarkan eigenvalue dan parallel analysis, sehingga faktor yang diambil valid untuk analisis.

library(psych)

fa_result <- fa(data_scaled, nfactors = 5, rotate = "varimax")

print(fa_result)

## Factor Analysis using method =  minres
## Call: fa(r = data_scaled, nfactors = 5, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##        MR2   MR3   MR1   MR4   MR5    h2   u2 com
## Age   0.07 -0.18 -0.06  0.05  0.42 0.222 0.78 1.5
## ALB  -0.15  0.69  0.15 -0.06 -0.14 0.541 0.46 1.3
## ALP   0.12 -0.09  0.18  0.69  0.27 0.605 0.39 1.6
## ALT   0.21 -0.01  0.54  0.13 -0.02 0.350 0.65 1.4
## AST   0.94 -0.06  0.01 -0.05 -0.03 0.898 0.10 1.0
## BIL   0.34 -0.07 -0.36  0.10  0.02 0.257 0.74 2.2
## CHE  -0.21  0.37  0.56 -0.06  0.11 0.517 0.48 2.2
## CHOL -0.17  0.30  0.37 -0.03  0.51 0.522 0.48 2.8
## CREA -0.02  0.00 -0.04  0.26 -0.06 0.074 0.93 1.2
## GGT   0.53 -0.06  0.08  0.49  0.18 0.564 0.44 2.3
## PROT  0.06  0.83  0.03 -0.02 -0.02 0.690 0.31 1.0
## 
##                        MR2  MR3  MR1  MR4  MR5
## SS loadings           1.45 1.44 0.94 0.83 0.59
## Proportion Var        0.13 0.13 0.09 0.08 0.05
## Cumulative Var        0.13 0.26 0.35 0.42 0.48
## Proportion Explained  0.28 0.27 0.18 0.16 0.11
## Cumulative Proportion 0.28 0.55 0.73 0.89 1.00
## 
## Mean item complexity =  1.7
## Test of the hypothesis that 5 factors are sufficient.
## 
## df null model =  55  with the objective function =  2.03 with Chi Square =  1182.65
## df of  the model are 10  and the objective function was  0.03 
## 
## The root mean square of the residuals (RMSR) is  0.01 
## The df corrected root mean square of the residuals is  0.03 
## 
## The harmonic n.obs is  589 with the empirical chi square  6.01  with prob <  0.81 
## The total n.obs was  589  with Likelihood Chi Square =  18.82  with prob <  0.043 
## 
## Tucker Lewis Index of factoring reliability =  0.957
## RMSEA index =  0.039  and the 90 % confidence intervals are  0.007 0.065
## BIC =  -44.96
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    MR2  MR3  MR1  MR4   MR5
## Correlation of (regression) scores with factors   0.95 0.88 0.76 0.79  0.68
## Multiple R square of scores with factors          0.91 0.77 0.57 0.62  0.47
## Minimum correlation of possible factor scores     0.81 0.54 0.15 0.24 -0.07

fa_result$loadings

## 
## Loadings:
##      MR2    MR3    MR1    MR4    MR5   
## Age         -0.177                0.424
## ALB  -0.152  0.686  0.151        -0.143
## ALP   0.118         0.177  0.691  0.273
## ALT   0.212         0.535  0.131       
## AST   0.944                            
## BIL   0.338        -0.357              
## CHE  -0.211  0.373  0.563         0.113
## CHOL -0.172  0.302  0.370         0.513
## CREA                       0.262       
## GGT   0.532                0.490  0.178
## PROT         0.828                     
## 
##                  MR2   MR3   MR1   MR4   MR5
## SS loadings    1.452 1.437 0.936 0.826 0.589
## Proportion Var 0.132 0.131 0.085 0.075 0.054
## Cumulative Var 0.132 0.263 0.348 0.423 0.476

Analisis faktor dilakukan dengan rotasi varimax untuk mempermudah interpretasi. Hasil FA menunjukkan kelompok variabel yang saling berkorelasi membentuk faktor laten, sehingga struktur data klinis yang kompleks dapat disederhanakan menjadi beberapa faktor yang lebih mudah dipahami.

Modul 1 Implementasi Principal Component Analysis (PCA) dan Factor Analysis (FA) pada Parameter Fungsi Hati Berdasarkan HCV Dataset

Ananta Putri Clara (24031554031), Ayu Anggraini (24031554054)