The goal of this paper is to perform one of the methods of dimension reduction in unsupervised learning (PCA or MCA). The dataset used in this paper is HCV dataset which was taken from UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/HCV+data#). Within the dataset, features were observed for 615 Hepatitis C patients and blood donors.
library(readxl)
library(readr)
library(naniar)
library(stats)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ stringr 1.4.1
## ✔ tidyr 1.2.0 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
library(ggplot2)
library(psych)
##
## Attaching package: 'psych'
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(ggplot2)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
hcv <- read_csv("~/Desktop/Seto/UL/Project/Dimension Reduction/hcvdat0.csv")
## New names:
## Rows: 615 Columns: 14
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): Category, Sex dbl (12): ...1, Age, ALB, ALP, ALT, AST, BIL, CHE, CHOL,
## CREA, GGT, PROT
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
dim(hcv)
## [1] 615 14
head(hcv, 10)
## # A tibble: 10 × 14
## ...1 Category Age Sex ALB ALP ALT AST BIL CHE CHOL CREA
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0=Blood Do… 32 m 38.5 52.5 7.7 22.1 7.5 6.93 3.23 106
## 2 2 0=Blood Do… 32 m 38.5 70.3 18 24.7 3.9 11.2 4.8 74
## 3 3 0=Blood Do… 32 m 46.9 74.7 36.2 52.6 6.1 8.84 5.2 86
## 4 4 0=Blood Do… 32 m 43.2 52 30.6 22.6 18.9 7.33 4.74 80
## 5 5 0=Blood Do… 32 m 39.2 74.1 32.6 24.8 9.6 9.15 4.32 76
## 6 6 0=Blood Do… 32 m 41.6 43.3 18.5 19.7 12.3 9.92 6.05 111
## 7 7 0=Blood Do… 32 m 46.3 41.3 17.5 17.8 8.5 7.01 4.79 70
## 8 8 0=Blood Do… 32 m 42.2 41.9 35.8 31.1 16.1 5.82 4.6 109
## 9 9 0=Blood Do… 32 m 50.9 65.5 23.2 21.2 6.9 8.69 4.1 83
## 10 10 0=Blood Do… 32 m 42.4 86.3 20.3 20 35.2 5.46 4.45 81
## # … with 2 more variables: GGT <dbl>, PROT <dbl>
There are 14 attributes in the dataset.
All atrributes are numerical, except for Category and Sex.
PCA or MCA can’t be performed on the entire dataset as the variables within it have varying types and scales. Since PCA works best with numerical data and since we are performing under Unsupervised Learning, let’s omit the classification column which is Category. We also don’t need the patient ID or number for this analysis. Thsi, let’s remove the first two columns.
colnames(hcv)
## [1] "...1" "Category" "Age" "Sex" "ALB" "ALP"
## [7] "ALT" "AST" "BIL" "CHE" "CHOL" "CREA"
## [13] "GGT" "PROT"
hcv <- hcv[,-(1:2)]
The variable Sex is categorical. Since we are going to perform PCA, we can’t use categorical variable. Therefore, we can change Sex into numerical variable by subtitution of the code to dummy variable, 0 and 1.
hcv$Sex <- ifelse(hcv$Sex == "m", 0, 1)
summary(hcv)
## Age Sex ALB ALP
## Min. :19.00 Min. :0.000 Min. :14.90 Min. : 11.30
## 1st Qu.:39.00 1st Qu.:0.000 1st Qu.:38.80 1st Qu.: 52.50
## Median :47.00 Median :0.000 Median :41.95 Median : 66.20
## Mean :47.41 Mean :0.387 Mean :41.62 Mean : 68.28
## 3rd Qu.:54.00 3rd Qu.:1.000 3rd Qu.:45.20 3rd Qu.: 80.10
## Max. :77.00 Max. :1.000 Max. :82.20 Max. :416.60
## NA's :1 NA's :18
## ALT AST BIL CHE
## Min. : 0.90 Min. : 10.60 Min. : 0.8 Min. : 1.420
## 1st Qu.: 16.40 1st Qu.: 21.60 1st Qu.: 5.3 1st Qu.: 6.935
## Median : 23.00 Median : 25.90 Median : 7.3 Median : 8.260
## Mean : 28.45 Mean : 34.79 Mean : 11.4 Mean : 8.197
## 3rd Qu.: 33.08 3rd Qu.: 32.90 3rd Qu.: 11.2 3rd Qu.: 9.590
## Max. :325.30 Max. :324.00 Max. :254.0 Max. :16.410
## NA's :1
## CHOL CREA GGT PROT
## Min. :1.430 Min. : 8.00 Min. : 4.50 Min. :44.80
## 1st Qu.:4.610 1st Qu.: 67.00 1st Qu.: 15.70 1st Qu.:69.30
## Median :5.300 Median : 77.00 Median : 23.30 Median :72.20
## Mean :5.368 Mean : 81.29 Mean : 39.53 Mean :72.04
## 3rd Qu.:6.060 3rd Qu.: 88.00 3rd Qu.: 40.20 3rd Qu.:75.40
## Max. :9.670 Max. :1079.10 Max. :650.90 Max. :90.00
## NA's :10 NA's :1
Now, all varialbles are numeric. But we have to keep in mind there might be missing values in the dataset. Let’s check it.
gg_miss_var(hcv)
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
vis_miss(hcv)
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
As shown at the graph above, there are more than 15 missing values in variable ALP (2.93%), 10 missing values in variable CHOL (1.63%), and 1 missing value in variable ALT (0.163), ALB (0.16) and PROT (0.163).
Let’s check the distribution of those variables. If they are normally distributed, then we can impute the mean, if it is not, then we can impute median.
par(mfrow=c(2,2))
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
hist.data.frame(hcv)
options(scipen=100,digits=4)
shapiro.test(hcv$ALP)
##
## Shapiro-Wilk normality test
##
## data: hcv$ALP
## W = 0.77, p-value <0.0000000000000002
shapiro.test(hcv$CHOL)
##
## Shapiro-Wilk normality test
##
## data: hcv$CHOL
## W = 0.99, p-value = 0.0003
shapiro.test(hcv$PROT)
##
## Shapiro-Wilk normality test
##
## data: hcv$PROT
## W = 0.95, p-value = 0.00000000000003
shapiro.test(hcv$ALT)
##
## Shapiro-Wilk normality test
##
## data: hcv$ALT
## W = 0.58, p-value <0.0000000000000002
From the graphs and results above, we can conclude that all distribution are non-normal. Therefore, we will use the median as the measure of location and will use it as imputation to the missing data.
hcv[is.na(hcv$ALP),"ALP"] <- median(hcv$ALP,na.rm=TRUE)
hcv[is.na(hcv$CHOL),"CHOL"] <- median(hcv$CHOL,na.rm=TRUE)
hcv[is.na(hcv$ALB),"ALB"] <- median(hcv$ALB,na.rm=TRUE)
hcv[is.na(hcv$PROT),"PROT"] <- median(hcv$PROT,na.rm=TRUE)
hcv[is.na(hcv$ALT),"ALT"] <- median(hcv$ALT,na.rm=TRUE)
library(visdat)
summary(hcv)
## Age Sex ALB ALP ALT
## Min. :19.0 Min. :0.000 Min. :14.9 Min. : 11.3 Min. : 0.9
## 1st Qu.:39.0 1st Qu.:0.000 1st Qu.:38.8 1st Qu.: 53.0 1st Qu.: 16.4
## Median :47.0 Median :0.000 Median :42.0 Median : 66.2 Median : 23.0
## Mean :47.4 Mean :0.387 Mean :41.6 Mean : 68.2 Mean : 28.4
## 3rd Qu.:54.0 3rd Qu.:1.000 3rd Qu.:45.2 3rd Qu.: 79.3 3rd Qu.: 33.0
## Max. :77.0 Max. :1.000 Max. :82.2 Max. :416.6 Max. :325.3
## AST BIL CHE CHOL
## Min. : 10.6 Min. : 0.8 Min. : 1.42 Min. :1.43
## 1st Qu.: 21.6 1st Qu.: 5.3 1st Qu.: 6.93 1st Qu.:4.62
## Median : 25.9 Median : 7.3 Median : 8.26 Median :5.30
## Mean : 34.8 Mean : 11.4 Mean : 8.20 Mean :5.37
## 3rd Qu.: 32.9 3rd Qu.: 11.2 3rd Qu.: 9.59 3rd Qu.:6.05
## Max. :324.0 Max. :254.0 Max. :16.41 Max. :9.67
## CREA GGT PROT
## Min. : 8.0 Min. : 4.5 Min. :44.8
## 1st Qu.: 67.0 1st Qu.: 15.7 1st Qu.:69.3
## Median : 77.0 Median : 23.3 Median :72.2
## Mean : 81.3 Mean : 39.5 Mean :72.0
## 3rd Qu.: 88.0 3rd Qu.: 40.2 3rd Qu.:75.4
## Max. :1079.1 Max. :650.9 Max. :90.0
vis_miss(hcv)
Now, there is no more NAs in the data. Now, we can proceed to the pre-analysis.
cor_hcv <- cor(hcv)
corrplot(cor_hcv, type = "lower", order = "hclust", tl.col = "black", tl.cex = 0.5)
Based on the result from the correlation plot, we can observe that there
is positive correlations between PROT and ALB, CHOL and CHE, GGT and AST
and ALP and GGT.
Before we proceed to the analysis, we should determine whether the dataset has the required characteristic. Data with limited or no correlation between variables are not appropriate for factor analysis. We will use KMO (Kaiser-Meyer-Olkin) and Bartlett’s tests to test if the dataset is suitable for factor analysis.
options(scipen = 100, digits = 4)
dim(hcv)
## [1] 615 12
KMO(cor_hcv)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor_hcv)
## Overall MSA = 0.61
## MSA for each item =
## Age Sex ALB ALP ALT AST BIL CHE CHOL CREA GGT PROT
## 0.65 0.46 0.65 0.53 0.64 0.59 0.67 0.69 0.63 0.51 0.61 0.58
cortest.bartlett(cor_hcv, n=615)
## $chisq
## [1] 1303
##
## $p.value
## [1] 4.666e-229
##
## $df
## [1] 66
The KMO statistic measures the extent to which the partial correlations between pairs of variables in a factor analysis are smaller than the original correlations, accounting for the influence of all other variables in the analysis.
A KMO value of greater than 0.6 is considered good, meaning that factor analysis is likely to be useful. This is usually the case when most of the original correlations are positive. On the other hand, a KMO value less than 0.5 indicates that most of the original correlations are negative, and corrective action is needed, such as removing the problematic variables or adding related variables.
based on the result shown above, our KMO value is 0.61. Therefore, we can proceed to PCA and MCA.
PCA transforms the data into a lower-dimensional space while preserving the most significant amount of variance. It is useful for evaluating multivariate datasets with multiple correlated numerical features. Usually, the initial components retain the majority of the information in the original data, reducing it from 11 dimensions (the original variables) to just 2 dimensions (two variables that summarize the 11).
To run PCA, we should make sure that all variables have the same scales. Otherwise, variables with substantial dispersion will exhibit an excessive representation. in PCA(), we will set scale.unit = TRUE which will ensure that variables are scaled to unit variance before crunching the numbers.
hcv_pca <- prcomp(hcv, center = TRUE, scale = TRUE)
summary(hcv_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.569 1.378 1.191 1.0698 0.9899 0.9627 0.8647 0.8059
## Proportion of Variance 0.205 0.158 0.118 0.0954 0.0817 0.0772 0.0623 0.0541
## Cumulative Proportion 0.205 0.363 0.482 0.5770 0.6586 0.7359 0.7982 0.8523
## PC9 PC10 PC11 PC12
## Standard deviation 0.7662 0.6849 0.6185 0.5780
## Proportion of Variance 0.0489 0.0391 0.0319 0.0278
## Cumulative Proportion 0.9012 0.9403 0.9722 1.0000
eig.val <- get_eigenvalue(hcv_pca)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.4617 20.514 20.51
## Dim.2 1.8992 15.827 36.34
## Dim.3 1.4180 11.817 48.16
## Dim.4 1.1445 9.538 57.70
## Dim.5 0.9799 8.166 65.86
## Dim.6 0.9268 7.723 73.58
## Dim.7 0.7477 6.231 79.82
## Dim.8 0.6494 5.412 85.23
## Dim.9 0.5871 4.892 90.12
## Dim.10 0.4690 3.909 94.03
## Dim.11 0.3825 3.187 97.22
## Dim.12 0.3341 2.784 100.00
We obtain 12 principal components which we call PC1-12. They explain the percentage of the total variation in the dataset. That is to say: PC1 explains 20% of the total variance, which means that around a fifth of the information in hcv dataset which contains 12 variables can be summarized by just that PC1. Then, PC2 can explain 15.8% of the total variance, and so on.
fviz_eig(hcv_pca, choice = "eigenvalue", ncp = 22, barfill = "blue", barcolor = "black", linecolor = "black", addlabels = TRUE, main = "Eigen Values")
based on the Kaiser’s rule, 4 components should be chosen as their eigen
values are above 1 (PC1-4).
fviz_screeplot(hcv_pca, addlabels = TRUE, barfill = "blue", barcolor = "blue", linecolor = "maroon")
PC1 is able to explain only 20.5% of the total variance, thus, recalling
the eigenvalue, we can use 4 first principal components to explain 57.6%
of the total variance.
According to Abdi and Williams (2010), the relationship between a variable and a principal component (PC) is depicted by the correlation between them, which serves as the coordinates of the variable on the PC. The way variables are represented is distinct from how observations are represented. While observations are shown through their projections, variables are portrayed through their correlations.
fviz_pca_var(hcv_pca, col.var = "contrib")
fviz_pca_ind(hcv_pca, col.ind="cos2", geom = "point", gradient.cols = c("#99FF33", "#CC0066", "black" ))
library("pdp")
##
## Attaching package: 'pdp'
## The following object is masked from 'package:purrr':
##
## partial
pca_var <- get_pca_var(hcv_pca)
fviz_contrib(hcv_pca, "var", axes = 1:6, fill = "tomato3", color = "tomato4")
fviz_contrib(hcv_pca, choice = "var", axes = 1:2, top = 10)
In conclusion, the purpose of this research was to explore the potential of using Principal Component Analysis (PCA) for dimension reduction on the hepatitis C virus (HCV) dataset. The results indicate that the dataset can be effectively described by a smaller number of variables, with latent concepts being represented by 4 dimensions. These findings suggest that PCA is a valuable tool for reducing the complexity of large datasets and may provide valuable insights into the underlying structure of HCV data. The results of this study could serve as a starting point for future research into HCV and the use of PCA for dimension reduction in other areas.
Abdi, Hervé, and Lynne J. Williams. 2010. “Principal Component Analysis.” John Wiley and Sons, Inc. WIREs Comp Stat 2: 433–59. http://staff.ustc.edu.cn/~zwp/teach/MVA/abdi-awPCA2010.pdf.
UCI Machine Learning Repository. (2018). Breast Cancer Coimbra Data Set. https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Coimbra
Hayden, Luke. (2018, August). Principal Component Analysis in R Tutorial. Datacamp. https://www.datacamp.com/tutorial/pca-analysis-r
Patrício, M., Pereira, J., Crisóstomo, J. et al. Using Resistin, glucose, age and BMI to predict the presence of breast cancer. BMC Cancer 18, 29 (2018). https://doi.org/10.1186/s12885-017-3877-1
IBM. (2020). Kaiser-Meyer-Olkin measure for identity correlation matrix. https://www.ibm.com/support/pages/kaiser-meyer-olkin-measure-identity-correlation-matrix