The goal of this paper is to perform one of the methods of dimension reduction in unsupervised learning (PCA or MCA). The dataset used in this paper is HCV dataset which was taken from UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/HCV+data#). Within the dataset, features were observed for 615 Hepatitis C patients and blood donors.
library(readxl)
library(readr)
library(naniar)
library(stats)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ stringr 1.4.1
## ✔ tidyr 1.2.0 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
library(ggplot2)
library(psych)
##
## Attaching package: 'psych'
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
hcv <- read_csv("~/Desktop/Seto/UL/Project/Dimension Reduction/hcvdat0.csv")
## New names:
## Rows: 615 Columns: 14
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): Category, Sex dbl (12): ...1, Age, ALB, ALP, ALT, AST, BIL, CHE, CHOL,
## CREA, GGT, PROT
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
dim(hcv)
## [1] 615 14
head(hcv, 10)
## # A tibble: 10 × 14
## ...1 Category Age Sex ALB ALP ALT AST BIL CHE CHOL CREA
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0=Blood Do… 32 m 38.5 52.5 7.7 22.1 7.5 6.93 3.23 106
## 2 2 0=Blood Do… 32 m 38.5 70.3 18 24.7 3.9 11.2 4.8 74
## 3 3 0=Blood Do… 32 m 46.9 74.7 36.2 52.6 6.1 8.84 5.2 86
## 4 4 0=Blood Do… 32 m 43.2 52 30.6 22.6 18.9 7.33 4.74 80
## 5 5 0=Blood Do… 32 m 39.2 74.1 32.6 24.8 9.6 9.15 4.32 76
## 6 6 0=Blood Do… 32 m 41.6 43.3 18.5 19.7 12.3 9.92 6.05 111
## 7 7 0=Blood Do… 32 m 46.3 41.3 17.5 17.8 8.5 7.01 4.79 70
## 8 8 0=Blood Do… 32 m 42.2 41.9 35.8 31.1 16.1 5.82 4.6 109
## 9 9 0=Blood Do… 32 m 50.9 65.5 23.2 21.2 6.9 8.69 4.1 83
## 10 10 0=Blood Do… 32 m 42.4 86.3 20.3 20 35.2 5.46 4.45 81
## # … with 2 more variables: GGT <dbl>, PROT <dbl>
There are 14 attributes in the dataset.
All atrributes are numerical, except for Category and Sex.
PCA or MCA can’t be performed on the entire dataset as the variables within it have varying types and scales. Since PCA works best with numerical data and since we are performing under Unsupervised Learning, let’s omit the classification column which is Category. We also don’t need the patient ID or number for this analysis. Thsi, let’s remove the first two columns.
colnames(hcv)
## [1] "...1" "Category" "Age" "Sex" "ALB" "ALP"
## [7] "ALT" "AST" "BIL" "CHE" "CHOL" "CREA"
## [13] "GGT" "PROT"
hcv <- hcv[,-(1:2)]
The variable Sex is categorical. Since we are going to perform PCA, we can’t use categorical variable. Therefore, we can change Sex into numerical variable by subtitution of the code to dummy variable, 0 and 1.
hcv$Sex <- ifelse(hcv$Sex == "m", 0, 1)
Now, all varialbles are numeric. But we have to keep in mind there might be missing values in the dataset. Let’s check it.
gg_miss_var(hcv)
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
vis_miss(hcv)
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
As shown at the graph above, there are more than 15 missing values in
variable ALP (2.93%), 10 missing values in variable CHOL (1.63%), and 1
missing value in variable ALT (0.163), ALB (0.16) and PROT (0.163).
Let’s check the distribution of those variables. If they are normally distributed, then we can impute the mean, if it is not, then we can impute median.
par(mfrow=c(3,2))
ggplot(hcv, aes(x=ALP)) + geom_histogram(bins = 50)
## Warning: Removed 18 rows containing non-finite values (stat_bin).
ggplot(hcv, aes(x=CHOL)) + geom_histogram(bins =40)
## Warning: Removed 10 rows containing non-finite values (stat_bin).
ggplot(hcv, aes(x=PROT)) + geom_histogram(bins =40)
## Warning: Removed 1 rows containing non-finite values (stat_bin).
ggplot(hcv, aes(x=ALT)) + geom_histogram(bins =40)
## Warning: Removed 1 rows containing non-finite values (stat_bin).
options(scipen=100,digits=4)
shapiro.test(hcv$ALP)
##
## Shapiro-Wilk normality test
##
## data: hcv$ALP
## W = 0.77, p-value <0.0000000000000002
shapiro.test(hcv$CHOL)
##
## Shapiro-Wilk normality test
##
## data: hcv$CHOL
## W = 0.99, p-value = 0.0003
shapiro.test(hcv$PROT)
##
## Shapiro-Wilk normality test
##
## data: hcv$PROT
## W = 0.95, p-value = 0.00000000000003
shapiro.test(hcv$ALT)
##
## Shapiro-Wilk normality test
##
## data: hcv$ALT
## W = 0.58, p-value <0.0000000000000002
From the graphs and results above, we can conclude that all distribution are non-normal. Therefore, we will use the median as the measure of location and will use it as imputation to the missing data.
hcv[is.na(hcv$ALP),"ALP"] <- median(hcv$ALP,na.rm=TRUE)
hcv[is.na(hcv$CHOL),"CHOL"] <- median(hcv$CHOL,na.rm=TRUE)
hcv[is.na(hcv$ALB),"ALB"] <- median(hcv$ALB,na.rm=TRUE)
hcv[is.na(hcv$PROT),"PROT"] <- median(hcv$PROT,na.rm=TRUE)
hcv[is.na(hcv$ALT),"ALT"] <- median(hcv$ALT,na.rm=TRUE)
summary(hcv)
## Age Sex ALB ALP ALT
## Min. :19.0 Min. :0.000 Min. :14.9 Min. : 11.3 Min. : 0.9
## 1st Qu.:39.0 1st Qu.:0.000 1st Qu.:38.8 1st Qu.: 53.0 1st Qu.: 16.4
## Median :47.0 Median :0.000 Median :42.0 Median : 66.2 Median : 23.0
## Mean :47.4 Mean :0.387 Mean :41.6 Mean : 68.2 Mean : 28.4
## 3rd Qu.:54.0 3rd Qu.:1.000 3rd Qu.:45.2 3rd Qu.: 79.3 3rd Qu.: 33.0
## Max. :77.0 Max. :1.000 Max. :82.2 Max. :416.6 Max. :325.3
## AST BIL CHE CHOL
## Min. : 10.6 Min. : 0.8 Min. : 1.42 Min. :1.43
## 1st Qu.: 21.6 1st Qu.: 5.3 1st Qu.: 6.93 1st Qu.:4.62
## Median : 25.9 Median : 7.3 Median : 8.26 Median :5.30
## Mean : 34.8 Mean : 11.4 Mean : 8.20 Mean :5.37
## 3rd Qu.: 32.9 3rd Qu.: 11.2 3rd Qu.: 9.59 3rd Qu.:6.05
## Max. :324.0 Max. :254.0 Max. :16.41 Max. :9.67
## CREA GGT PROT
## Min. : 8.0 Min. : 4.5 Min. :44.8
## 1st Qu.: 67.0 1st Qu.: 15.7 1st Qu.:69.3
## Median : 77.0 Median : 23.3 Median :72.2
## Mean : 81.3 Mean : 39.5 Mean :72.0
## 3rd Qu.: 88.0 3rd Qu.: 40.2 3rd Qu.:75.4
## Max. :1079.1 Max. :650.9 Max. :90.0
vis_miss(hcv)
Now, there is no more NAs in the data. Now, we can proceed to the pre-analysis.
cor_hcv <- cor(hcv)
corrplot(cor_hcv, type = "lower", order = "hclust", tl.col = "black", tl.cex = 0.5)
Based on the result from the correlation plot, we can observe that there
is positive correlations between PROT and ALB, GGT and AST and ALP and
GGT.
Before proceed to the analysis, we should determine whether the dataset has the required characteristic. Data with limited or no correlation between variables are not appropriate for factor analysis. We will use KMO (Kaiser-Meyer-Olkin) and Bartlett’s tests to test if the dataset is suitable for factor analysis.
options(scipen = 100, digits = 4)
dim(hcv)
## [1] 615 12
KMO(cor_hcv)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor_hcv)
## Overall MSA = 0.61
## MSA for each item =
## Age Sex ALB ALP ALT AST BIL CHE CHOL CREA GGT PROT
## 0.65 0.46 0.65 0.53 0.64 0.59 0.67 0.69 0.63 0.51 0.61 0.58
cortest.bartlett(cor_hcv, n=615)
## $chisq
## [1] 1303
##
## $p.value
## [1] 4.666e-229
##
## $df
## [1] 66
The KMO statistic measures the extent to which the partial correlations between pairs of variables in a factor analysis are smaller than the original correlations, accounting for the influence of all other variables in the analysis.
A KMO value of greater than 0.6 is considered good, meaning that factor analysis is likely to be useful. This is usually the case when most of the original correlations are positive. On the other hand, a KMO value less than 0.5 indicates that most of the original correlations are negative, and corrective action is needed, such as removing the problematic variables or adding related variables.
based on the result shown above, our KMO value is 0.61. Therefore, we can proceed to PCA and MCA.
PCA transforms the data into a lower-dimensional space while preserving the most significant amount of variance. It is useful for evaluating multivariate datasets with multiple correlated numerical features. Usually, the initial components retain the majority of the information in the original data, reducing it from 11 dimensions (the original variables) to just 2 dimensions (two variables that summarize the 11).
To run PCA, we should make sure that all variables have the same scales. Otherwise, variables with substantial dispersion will exhibit an excessive representation. in PCA(), we will set scale.unit = TRUE which will ensure that variables are scaled to unit variance before crunching the numbers.
library(FactoMineR)
( hcv_pca <- PCA(hcv, scale.unit = TRUE))
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 615 individuals, described by 12 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
From the PCA graph above, we can observe how the original variables are correlated among themselves and the principal components. Now, we will build a plot using factoextra package to help us understand these relationships. A correlation circle plot shows the relationship among all variables as they are plotted on the first two principal components (Dim 1 and DIm2).
Note on the plot:
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
(pca_var_plot <- fviz_pca_var(hcv_pca))
# Sum the variance preserved by the first two components. Print the
result.
( variance_first_two_pca <- hcv_pca$eig[1, 2] + hcv_pca$eig[2,2] )
## [1] 36.34
The first two principal components only represent 35% of the variance. From the variable correlation plot, we can see that thal is highly correlated with exang; slope, thalach and target are correlated; fbs, cho and trestbps are highly correlated.
fviz_pca_ind(hcv_pca, title = "Observation - PCA")
UCI Machine Learning Repository. (2018). Breast Cancer Coimbra Data Set. https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Coimbra
Hayden, Luke. (2018, August). Principal Component Analysis in R Tutorial. Datacamp. https://www.datacamp.com/tutorial/pca-analysis-r
Patrício, M., Pereira, J., Crisóstomo, J. et al. Using Resistin, glucose, age and BMI to predict the presence of breast cancer. BMC Cancer 18, 29 (2018). https://doi.org/10.1186/s12885-017-3877-1
IBM. (2020). Kaiser-Meyer-Olkin measure for identity correlation matrix. https://www.ibm.com/support/pages/kaiser-meyer-olkin-measure-identity-correlation-matrix