Introduction

The goal of this paper is to perform one of the methods of dimension reduction in unsupervised learning (PCA or MCA). The dataset used in this paper is HCV dataset which was taken from UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/HCV+data#). Within the dataset, features were observed for 615 Hepatitis C patients and blood donors.

Load required libraries

library(readxl)
library(readr)
library(naniar)
library(stats)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ stringr 1.4.1
## ✔ tidyr   1.2.0     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
library(ggplot2)
library(psych)
## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(ggplot2)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Load the dataset

hcv <- read_csv("~/Desktop/Seto/UL/Project/Dimension Reduction/hcvdat0.csv")
## New names:
## Rows: 615 Columns: 14
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): Category, Sex dbl (12): ...1, Age, ALB, ALP, ALT, AST, BIL, CHE, CHOL,
## CREA, GGT, PROT
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
dim(hcv)
## [1] 615  14
head(hcv, 10)
## # A tibble: 10 × 14
##     ...1 Category      Age Sex     ALB   ALP   ALT   AST   BIL   CHE  CHOL  CREA
##    <dbl> <chr>       <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     1 0=Blood Do…    32 m      38.5  52.5   7.7  22.1   7.5  6.93  3.23   106
##  2     2 0=Blood Do…    32 m      38.5  70.3  18    24.7   3.9 11.2   4.8     74
##  3     3 0=Blood Do…    32 m      46.9  74.7  36.2  52.6   6.1  8.84  5.2     86
##  4     4 0=Blood Do…    32 m      43.2  52    30.6  22.6  18.9  7.33  4.74    80
##  5     5 0=Blood Do…    32 m      39.2  74.1  32.6  24.8   9.6  9.15  4.32    76
##  6     6 0=Blood Do…    32 m      41.6  43.3  18.5  19.7  12.3  9.92  6.05   111
##  7     7 0=Blood Do…    32 m      46.3  41.3  17.5  17.8   8.5  7.01  4.79    70
##  8     8 0=Blood Do…    32 m      42.2  41.9  35.8  31.1  16.1  5.82  4.6    109
##  9     9 0=Blood Do…    32 m      50.9  65.5  23.2  21.2   6.9  8.69  4.1     83
## 10    10 0=Blood Do…    32 m      42.4  86.3  20.3  20    35.2  5.46  4.45    81
## # … with 2 more variables: GGT <dbl>, PROT <dbl>

Data Preparation

There are 14 attributes in the dataset.

  1. X (Patient ID/No.)
  2. Category (diagnosis) (values: ‘0=Blood Donor’, ‘0s=suspect Blood Donor’, ‘1=Hepatitis’, ‘2=Fibrosis’, ‘3=Cirrhosis’)
  3. Age (in years)
  4. Sex (f,m)
  5. ALB
  6. ALP
  7. ALT
  8. AST
  9. BIL
  10. CHE
  11. CHOL
  12. CREA
  13. GGT
  14. PROT

All atrributes are numerical, except for Category and Sex.

PCA or MCA can’t be performed on the entire dataset as the variables within it have varying types and scales. Since PCA works best with numerical data and since we are performing under Unsupervised Learning, let’s omit the classification column which is Category. We also don’t need the patient ID or number for this analysis. Thsi, let’s remove the first two columns.

colnames(hcv)
##  [1] "...1"     "Category" "Age"      "Sex"      "ALB"      "ALP"     
##  [7] "ALT"      "AST"      "BIL"      "CHE"      "CHOL"     "CREA"    
## [13] "GGT"      "PROT"
hcv <- hcv[,-(1:2)]

The variable Sex is categorical. Since we are going to perform PCA, we can’t use categorical variable. Therefore, we can change Sex into numerical variable by subtitution of the code to dummy variable, 0 and 1.

hcv$Sex <- ifelse(hcv$Sex == "m", 0, 1)

summary(hcv)
##       Age             Sex             ALB             ALP        
##  Min.   :19.00   Min.   :0.000   Min.   :14.90   Min.   : 11.30  
##  1st Qu.:39.00   1st Qu.:0.000   1st Qu.:38.80   1st Qu.: 52.50  
##  Median :47.00   Median :0.000   Median :41.95   Median : 66.20  
##  Mean   :47.41   Mean   :0.387   Mean   :41.62   Mean   : 68.28  
##  3rd Qu.:54.00   3rd Qu.:1.000   3rd Qu.:45.20   3rd Qu.: 80.10  
##  Max.   :77.00   Max.   :1.000   Max.   :82.20   Max.   :416.60  
##                                  NA's   :1       NA's   :18      
##       ALT              AST              BIL             CHE        
##  Min.   :  0.90   Min.   : 10.60   Min.   :  0.8   Min.   : 1.420  
##  1st Qu.: 16.40   1st Qu.: 21.60   1st Qu.:  5.3   1st Qu.: 6.935  
##  Median : 23.00   Median : 25.90   Median :  7.3   Median : 8.260  
##  Mean   : 28.45   Mean   : 34.79   Mean   : 11.4   Mean   : 8.197  
##  3rd Qu.: 33.08   3rd Qu.: 32.90   3rd Qu.: 11.2   3rd Qu.: 9.590  
##  Max.   :325.30   Max.   :324.00   Max.   :254.0   Max.   :16.410  
##  NA's   :1                                                         
##       CHOL            CREA              GGT              PROT      
##  Min.   :1.430   Min.   :   8.00   Min.   :  4.50   Min.   :44.80  
##  1st Qu.:4.610   1st Qu.:  67.00   1st Qu.: 15.70   1st Qu.:69.30  
##  Median :5.300   Median :  77.00   Median : 23.30   Median :72.20  
##  Mean   :5.368   Mean   :  81.29   Mean   : 39.53   Mean   :72.04  
##  3rd Qu.:6.060   3rd Qu.:  88.00   3rd Qu.: 40.20   3rd Qu.:75.40  
##  Max.   :9.670   Max.   :1079.10   Max.   :650.90   Max.   :90.00  
##  NA's   :10                                         NA's   :1

Now, all varialbles are numeric. But we have to keep in mind there might be missing values in the dataset. Let’s check it.

Check missing data

gg_miss_var(hcv)
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.

vis_miss(hcv)
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

As shown at the graph above, there are more than 15 missing values in variable ALP (2.93%), 10 missing values in variable CHOL (1.63%), and 1 missing value in variable ALT (0.163), ALB (0.16) and PROT (0.163).

Let’s check the distribution of those variables. If they are normally distributed, then we can impute the mean, if it is not, then we can impute median.

par(mfrow=c(2,2))
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
## 
##     describe
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
hist.data.frame(hcv)

options(scipen=100,digits=4)

shapiro.test(hcv$ALP)
## 
##  Shapiro-Wilk normality test
## 
## data:  hcv$ALP
## W = 0.77, p-value <0.0000000000000002
shapiro.test(hcv$CHOL)
## 
##  Shapiro-Wilk normality test
## 
## data:  hcv$CHOL
## W = 0.99, p-value = 0.0003
shapiro.test(hcv$PROT)
## 
##  Shapiro-Wilk normality test
## 
## data:  hcv$PROT
## W = 0.95, p-value = 0.00000000000003
shapiro.test(hcv$ALT)
## 
##  Shapiro-Wilk normality test
## 
## data:  hcv$ALT
## W = 0.58, p-value <0.0000000000000002

From the graphs and results above, we can conclude that all distribution are non-normal. Therefore, we will use the median as the measure of location and will use it as imputation to the missing data.

hcv[is.na(hcv$ALP),"ALP"] <- median(hcv$ALP,na.rm=TRUE) 
hcv[is.na(hcv$CHOL),"CHOL"] <- median(hcv$CHOL,na.rm=TRUE) 
hcv[is.na(hcv$ALB),"ALB"] <- median(hcv$ALB,na.rm=TRUE) 
hcv[is.na(hcv$PROT),"PROT"] <- median(hcv$PROT,na.rm=TRUE) 
hcv[is.na(hcv$ALT),"ALT"] <- median(hcv$ALT,na.rm=TRUE) 

library(visdat)
summary(hcv)
##       Age            Sex             ALB            ALP             ALT       
##  Min.   :19.0   Min.   :0.000   Min.   :14.9   Min.   : 11.3   Min.   :  0.9  
##  1st Qu.:39.0   1st Qu.:0.000   1st Qu.:38.8   1st Qu.: 53.0   1st Qu.: 16.4  
##  Median :47.0   Median :0.000   Median :42.0   Median : 66.2   Median : 23.0  
##  Mean   :47.4   Mean   :0.387   Mean   :41.6   Mean   : 68.2   Mean   : 28.4  
##  3rd Qu.:54.0   3rd Qu.:1.000   3rd Qu.:45.2   3rd Qu.: 79.3   3rd Qu.: 33.0  
##  Max.   :77.0   Max.   :1.000   Max.   :82.2   Max.   :416.6   Max.   :325.3  
##       AST             BIL             CHE             CHOL     
##  Min.   : 10.6   Min.   :  0.8   Min.   : 1.42   Min.   :1.43  
##  1st Qu.: 21.6   1st Qu.:  5.3   1st Qu.: 6.93   1st Qu.:4.62  
##  Median : 25.9   Median :  7.3   Median : 8.26   Median :5.30  
##  Mean   : 34.8   Mean   : 11.4   Mean   : 8.20   Mean   :5.37  
##  3rd Qu.: 32.9   3rd Qu.: 11.2   3rd Qu.: 9.59   3rd Qu.:6.05  
##  Max.   :324.0   Max.   :254.0   Max.   :16.41   Max.   :9.67  
##       CREA             GGT             PROT     
##  Min.   :   8.0   Min.   :  4.5   Min.   :44.8  
##  1st Qu.:  67.0   1st Qu.: 15.7   1st Qu.:69.3  
##  Median :  77.0   Median : 23.3   Median :72.2  
##  Mean   :  81.3   Mean   : 39.5   Mean   :72.0  
##  3rd Qu.:  88.0   3rd Qu.: 40.2   3rd Qu.:75.4  
##  Max.   :1079.1   Max.   :650.9   Max.   :90.0
vis_miss(hcv)

Now, there is no more NAs in the data. Now, we can proceed to the pre-analysis.

Pre-analysis Evaluation

cor_hcv <- cor(hcv)
corrplot(cor_hcv, type = "lower", order = "hclust", tl.col = "black", tl.cex = 0.5)

Based on the result from the correlation plot, we can observe that there is positive correlations between PROT and ALB, CHOL and CHE, GGT and AST and ALP and GGT.

Before we proceed to the analysis, we should determine whether the dataset has the required characteristic. Data with limited or no correlation between variables are not appropriate for factor analysis. We will use KMO (Kaiser-Meyer-Olkin) and Bartlett’s tests to test if the dataset is suitable for factor analysis.

options(scipen = 100, digits = 4)
dim(hcv)
## [1] 615  12
KMO(cor_hcv)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor_hcv)
## Overall MSA =  0.61
## MSA for each item = 
##  Age  Sex  ALB  ALP  ALT  AST  BIL  CHE CHOL CREA  GGT PROT 
## 0.65 0.46 0.65 0.53 0.64 0.59 0.67 0.69 0.63 0.51 0.61 0.58
cortest.bartlett(cor_hcv, n=615)
## $chisq
## [1] 1303
## 
## $p.value
## [1] 4.666e-229
## 
## $df
## [1] 66

The KMO statistic measures the extent to which the partial correlations between pairs of variables in a factor analysis are smaller than the original correlations, accounting for the influence of all other variables in the analysis.

A KMO value of greater than 0.6 is considered good, meaning that factor analysis is likely to be useful. This is usually the case when most of the original correlations are positive. On the other hand, a KMO value less than 0.5 indicates that most of the original correlations are negative, and corrective action is needed, such as removing the problematic variables or adding related variables.

based on the result shown above, our KMO value is 0.61. Therefore, we can proceed to PCA and MCA.

Principal Component Analysis

PCA transforms the data into a lower-dimensional space while preserving the most significant amount of variance. It is useful for evaluating multivariate datasets with multiple correlated numerical features. Usually, the initial components retain the majority of the information in the original data, reducing it from 11 dimensions (the original variables) to just 2 dimensions (two variables that summarize the 11).

To run PCA, we should make sure that all variables have the same scales. Otherwise, variables with substantial dispersion will exhibit an excessive representation. in PCA(), we will set scale.unit = TRUE which will ensure that variables are scaled to unit variance before crunching the numbers.

hcv_pca <- prcomp(hcv, center = TRUE, scale = TRUE)

summary(hcv_pca)
## Importance of components:
##                          PC1   PC2   PC3    PC4    PC5    PC6    PC7    PC8
## Standard deviation     1.569 1.378 1.191 1.0698 0.9899 0.9627 0.8647 0.8059
## Proportion of Variance 0.205 0.158 0.118 0.0954 0.0817 0.0772 0.0623 0.0541
## Cumulative Proportion  0.205 0.363 0.482 0.5770 0.6586 0.7359 0.7982 0.8523
##                           PC9   PC10   PC11   PC12
## Standard deviation     0.7662 0.6849 0.6185 0.5780
## Proportion of Variance 0.0489 0.0391 0.0319 0.0278
## Cumulative Proportion  0.9012 0.9403 0.9722 1.0000
eig.val <- get_eigenvalue(hcv_pca)
eig.val
##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1      2.4617           20.514                       20.51
## Dim.2      1.8992           15.827                       36.34
## Dim.3      1.4180           11.817                       48.16
## Dim.4      1.1445            9.538                       57.70
## Dim.5      0.9799            8.166                       65.86
## Dim.6      0.9268            7.723                       73.58
## Dim.7      0.7477            6.231                       79.82
## Dim.8      0.6494            5.412                       85.23
## Dim.9      0.5871            4.892                       90.12
## Dim.10     0.4690            3.909                       94.03
## Dim.11     0.3825            3.187                       97.22
## Dim.12     0.3341            2.784                      100.00

We obtain 12 principal components which we call PC1-12. They explain the percentage of the total variation in the dataset. That is to say: PC1 explains 20% of the total variance, which means that around a fifth of the information in hcv dataset which contains 12 variables can be summarized by just that PC1. Then, PC2 can explain 15.8% of the total variance, and so on.

fviz_eig(hcv_pca, choice = "eigenvalue", ncp = 22, barfill = "blue", barcolor = "black", linecolor = "black", addlabels = TRUE, main = "Eigen Values")

based on the Kaiser’s rule, 4 components should be chosen as their eigen values are above 1 (PC1-4).

fviz_screeplot(hcv_pca, addlabels = TRUE, barfill = "blue", barcolor = "blue", linecolor = "maroon")

PC1 is able to explain only 20.5% of the total variance, thus, recalling the eigenvalue, we can use 4 first principal components to explain 57.6% of the total variance.

According to Abdi and Williams (2010), the relationship between a variable and a principal component (PC) is depicted by the correlation between them, which serves as the coordinates of the variable on the PC. The way variables are represented is distinct from how observations are represented. While observations are shown through their projections, variables are portrayed through their correlations.

fviz_pca_var(hcv_pca, col.var = "contrib")

fviz_pca_ind(hcv_pca, col.ind="cos2", geom = "point", gradient.cols = c("#99FF33", "#CC0066", "black" ))

library("pdp")
## 
## Attaching package: 'pdp'
## The following object is masked from 'package:purrr':
## 
##     partial
pca_var <- get_pca_var(hcv_pca)

fviz_contrib(hcv_pca, "var", axes = 1:6, fill = "tomato3", color = "tomato4")

fviz_contrib(hcv_pca, choice = "var", axes = 1:2, top = 10)

Conclusion

In conclusion, the purpose of this research was to explore the potential of using Principal Component Analysis (PCA) for dimension reduction on the hepatitis C virus (HCV) dataset. The results indicate that the dataset can be effectively described by a smaller number of variables, with latent concepts being represented by 4 dimensions. These findings suggest that PCA is a valuable tool for reducing the complexity of large datasets and may provide valuable insights into the underlying structure of HCV data. The results of this study could serve as a starting point for future research into HCV and the use of PCA for dimension reduction in other areas.

Bibliography

Abdi, Hervé, and Lynne J. Williams. 2010. “Principal Component Analysis.” John Wiley and Sons, Inc. WIREs Comp Stat 2: 433–59. http://staff.ustc.edu.cn/~zwp/teach/MVA/abdi-awPCA2010.pdf.

UCI Machine Learning Repository. (2018). Breast Cancer Coimbra Data Set. https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Coimbra

Hayden, Luke. (2018, August). Principal Component Analysis in R Tutorial. Datacamp. https://www.datacamp.com/tutorial/pca-analysis-r

Patrício, M., Pereira, J., Crisóstomo, J. et al. Using Resistin, glucose, age and BMI to predict the presence of breast cancer. BMC Cancer 18, 29 (2018). https://doi.org/10.1186/s12885-017-3877-1

IBM. (2020). Kaiser-Meyer-Olkin measure for identity correlation matrix. https://www.ibm.com/support/pages/kaiser-meyer-olkin-measure-identity-correlation-matrix