Dimension Reduction Project

Introduction

The goal of this paper is to perform one of the methods of dimension reduction in unsupervised learning (PCA or MCA). The dataset used in this paper is HCV dataset which was taken from UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/HCV+data#). Within the dataset, features were observed for 615 Hepatitis C patients and blood donors.

Load required libraries

library(readxl)
library(readr)
library(naniar)
library(stats)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──

## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ stringr 1.4.1
## ✔ tidyr   1.2.0     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(tidyr)
library(corrplot)

## corrplot 0.92 loaded

library(ggplot2)
library(psych)

## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

Load the dataset

hcv <- read_csv("~/Desktop/Seto/UL/Project/Dimension Reduction/hcvdat0.csv")

## New names:
## Rows: 615 Columns: 14
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): Category, Sex dbl (12): ...1, Age, ALB, ALP, ALT, AST, BIL, CHE, CHOL,
## CREA, GGT, PROT
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

dim(hcv)

## [1] 615  14

head(hcv, 10)

## # A tibble: 10 × 14
##     ...1 Category      Age Sex     ALB   ALP   ALT   AST   BIL   CHE  CHOL  CREA
##    <dbl> <chr>       <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     1 0=Blood Do…    32 m      38.5  52.5   7.7  22.1   7.5  6.93  3.23   106
##  2     2 0=Blood Do…    32 m      38.5  70.3  18    24.7   3.9 11.2   4.8     74
##  3     3 0=Blood Do…    32 m      46.9  74.7  36.2  52.6   6.1  8.84  5.2     86
##  4     4 0=Blood Do…    32 m      43.2  52    30.6  22.6  18.9  7.33  4.74    80
##  5     5 0=Blood Do…    32 m      39.2  74.1  32.6  24.8   9.6  9.15  4.32    76
##  6     6 0=Blood Do…    32 m      41.6  43.3  18.5  19.7  12.3  9.92  6.05   111
##  7     7 0=Blood Do…    32 m      46.3  41.3  17.5  17.8   8.5  7.01  4.79    70
##  8     8 0=Blood Do…    32 m      42.2  41.9  35.8  31.1  16.1  5.82  4.6    109
##  9     9 0=Blood Do…    32 m      50.9  65.5  23.2  21.2   6.9  8.69  4.1     83
## 10    10 0=Blood Do…    32 m      42.4  86.3  20.3  20    35.2  5.46  4.45    81
## # … with 2 more variables: GGT <dbl>, PROT <dbl>

Data Preparation

There are 14 attributes in the dataset.

X (Patient ID/No.)
Category (diagnosis) (values: ‘0=Blood Donor’, ‘0s=suspect Blood Donor’, ‘1=Hepatitis’, ‘2=Fibrosis’, ‘3=Cirrhosis’)
Age (in years)
Sex (f,m)
ALB
ALP
ALT
AST
BIL
CHE
CHOL
CREA
GGT
PROT

All atrributes are numerical, except for Category and Sex.

PCA or MCA can’t be performed on the entire dataset as the variables within it have varying types and scales. Since PCA works best with numerical data and since we are performing under Unsupervised Learning, let’s omit the classification column which is Category. We also don’t need the patient ID or number for this analysis. Thsi, let’s remove the first two columns.

colnames(hcv)

##  [1] "...1"     "Category" "Age"      "Sex"      "ALB"      "ALP"     
##  [7] "ALT"      "AST"      "BIL"      "CHE"      "CHOL"     "CREA"    
## [13] "GGT"      "PROT"

hcv <- hcv[,-(1:2)]

The variable Sex is categorical. Since we are going to perform PCA, we can’t use categorical variable. Therefore, we can change Sex into numerical variable by subtitution of the code to dummy variable, 0 and 1.

hcv$Sex <- ifelse(hcv$Sex == "m", 0, 1)

Now, all varialbles are numeric. But we have to keep in mind there might be missing values in the dataset. Let’s check it.

Check missing data

gg_miss_var(hcv)

## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.

vis_miss(hcv)

## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

As shown at the graph above, there are more than 15 missing values in variable ALP (2.93%), 10 missing values in variable CHOL (1.63%), and 1 missing value in variable ALT (0.163), ALB (0.16) and PROT (0.163).

Let’s check the distribution of those variables. If they are normally distributed, then we can impute the mean, if it is not, then we can impute median.

par(mfrow=c(3,2))
ggplot(hcv, aes(x=ALP)) + geom_histogram(bins = 50)

## Warning: Removed 18 rows containing non-finite values (stat_bin).

ggplot(hcv, aes(x=CHOL)) + geom_histogram(bins =40)

## Warning: Removed 10 rows containing non-finite values (stat_bin).

ggplot(hcv, aes(x=PROT)) + geom_histogram(bins =40)

## Warning: Removed 1 rows containing non-finite values (stat_bin).

ggplot(hcv, aes(x=ALT)) + geom_histogram(bins =40)

## Warning: Removed 1 rows containing non-finite values (stat_bin).

options(scipen=100,digits=4)

shapiro.test(hcv$ALP)

## 
##  Shapiro-Wilk normality test
## 
## data:  hcv$ALP
## W = 0.77, p-value <0.0000000000000002

shapiro.test(hcv$CHOL)

## 
##  Shapiro-Wilk normality test
## 
## data:  hcv$CHOL
## W = 0.99, p-value = 0.0003

shapiro.test(hcv$PROT)

## 
##  Shapiro-Wilk normality test
## 
## data:  hcv$PROT
## W = 0.95, p-value = 0.00000000000003

shapiro.test(hcv$ALT)

## 
##  Shapiro-Wilk normality test
## 
## data:  hcv$ALT
## W = 0.58, p-value <0.0000000000000002

From the graphs and results above, we can conclude that all distribution are non-normal. Therefore, we will use the median as the measure of location and will use it as imputation to the missing data.

hcv[is.na(hcv$ALP),"ALP"] <- median(hcv$ALP,na.rm=TRUE) 
hcv[is.na(hcv$CHOL),"CHOL"] <- median(hcv$CHOL,na.rm=TRUE) 
hcv[is.na(hcv$ALB),"ALB"] <- median(hcv$ALB,na.rm=TRUE) 
hcv[is.na(hcv$PROT),"PROT"] <- median(hcv$PROT,na.rm=TRUE) 
hcv[is.na(hcv$ALT),"ALT"] <- median(hcv$ALT,na.rm=TRUE) 

summary(hcv)

##       Age            Sex             ALB            ALP             ALT       
##  Min.   :19.0   Min.   :0.000   Min.   :14.9   Min.   : 11.3   Min.   :  0.9  
##  1st Qu.:39.0   1st Qu.:0.000   1st Qu.:38.8   1st Qu.: 53.0   1st Qu.: 16.4  
##  Median :47.0   Median :0.000   Median :42.0   Median : 66.2   Median : 23.0  
##  Mean   :47.4   Mean   :0.387   Mean   :41.6   Mean   : 68.2   Mean   : 28.4  
##  3rd Qu.:54.0   3rd Qu.:1.000   3rd Qu.:45.2   3rd Qu.: 79.3   3rd Qu.: 33.0  
##  Max.   :77.0   Max.   :1.000   Max.   :82.2   Max.   :416.6   Max.   :325.3  
##       AST             BIL             CHE             CHOL     
##  Min.   : 10.6   Min.   :  0.8   Min.   : 1.42   Min.   :1.43  
##  1st Qu.: 21.6   1st Qu.:  5.3   1st Qu.: 6.93   1st Qu.:4.62  
##  Median : 25.9   Median :  7.3   Median : 8.26   Median :5.30  
##  Mean   : 34.8   Mean   : 11.4   Mean   : 8.20   Mean   :5.37  
##  3rd Qu.: 32.9   3rd Qu.: 11.2   3rd Qu.: 9.59   3rd Qu.:6.05  
##  Max.   :324.0   Max.   :254.0   Max.   :16.41   Max.   :9.67  
##       CREA             GGT             PROT     
##  Min.   :   8.0   Min.   :  4.5   Min.   :44.8  
##  1st Qu.:  67.0   1st Qu.: 15.7   1st Qu.:69.3  
##  Median :  77.0   Median : 23.3   Median :72.2  
##  Mean   :  81.3   Mean   : 39.5   Mean   :72.0  
##  3rd Qu.:  88.0   3rd Qu.: 40.2   3rd Qu.:75.4  
##  Max.   :1079.1   Max.   :650.9   Max.   :90.0

vis_miss(hcv)

Now, there is no more NAs in the data. Now, we can proceed to the pre-analysis.

Pre-analysis Evaluation

cor_hcv <- cor(hcv)
corrplot(cor_hcv, type = "lower", order = "hclust", tl.col = "black", tl.cex = 0.5)

Based on the result from the correlation plot, we can observe that there is positive correlations between PROT and ALB, GGT and AST and ALP and GGT.

Before proceed to the analysis, we should determine whether the dataset has the required characteristic. Data with limited or no correlation between variables are not appropriate for factor analysis. We will use KMO (Kaiser-Meyer-Olkin) and Bartlett’s tests to test if the dataset is suitable for factor analysis.

options(scipen = 100, digits = 4)
dim(hcv)

## [1] 615  12

KMO(cor_hcv)

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor_hcv)
## Overall MSA =  0.61
## MSA for each item = 
##  Age  Sex  ALB  ALP  ALT  AST  BIL  CHE CHOL CREA  GGT PROT 
## 0.65 0.46 0.65 0.53 0.64 0.59 0.67 0.69 0.63 0.51 0.61 0.58

cortest.bartlett(cor_hcv, n=615)

## $chisq
## [1] 1303
## 
## $p.value
## [1] 4.666e-229
## 
## $df
## [1] 66

The KMO statistic measures the extent to which the partial correlations between pairs of variables in a factor analysis are smaller than the original correlations, accounting for the influence of all other variables in the analysis.

A KMO value of greater than 0.6 is considered good, meaning that factor analysis is likely to be useful. This is usually the case when most of the original correlations are positive. On the other hand, a KMO value less than 0.5 indicates that most of the original correlations are negative, and corrective action is needed, such as removing the problematic variables or adding related variables.

based on the result shown above, our KMO value is 0.61. Therefore, we can proceed to PCA and MCA.

Principal Component Analysis

PCA transforms the data into a lower-dimensional space while preserving the most significant amount of variance. It is useful for evaluating multivariate datasets with multiple correlated numerical features. Usually, the initial components retain the majority of the information in the original data, reducing it from 11 dimensions (the original variables) to just 2 dimensions (two variables that summarize the 11).

To run PCA, we should make sure that all variables have the same scales. Otherwise, variables with substantial dispersion will exhibit an excessive representation. in PCA(), we will set scale.unit = TRUE which will ensure that variables are scaled to unit variance before crunching the numbers.

library(FactoMineR)
( hcv_pca <- PCA(hcv, scale.unit = TRUE))

## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 615 individuals, described by 12 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"

PCA: Variables and Components

From the PCA graph above, we can observe how the original variables are correlated among themselves and the principal components. Now, we will build a plot using factoextra package to help us understand these relationships. A correlation circle plot shows the relationship among all variables as they are plotted on the first two principal components (Dim 1 and DIm2).

Note on the plot:

Positively correlated variables have similar vectors.
The vectors of negatively correlated variables are on opposite sides of the plot origin (opposite quadrants).
Each axis represents a principal component. Vectors pointing in the direction of the component are correlated with that component.
The percentage of the original variance explained by each component (dimension) is given in parentheses in the axes labels.

library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

(pca_var_plot <- fviz_pca_var(hcv_pca))

# Sum the variance preserved by the first two components. Print the result.

( variance_first_two_pca <- hcv_pca$eig[1, 2] + hcv_pca$eig[2,2] )

## [1] 36.34

Plotting the components

The first two principal components only represent 35% of the variance. From the variable correlation plot, we can see that thal is highly correlated with exang; slope, thalach and target are correlated; fbs, cho and trestbps are highly correlated.

fviz_pca_ind(hcv_pca, title = "Observation - PCA")

CLuster using K means

Bibliography

UCI Machine Learning Repository. (2018). Breast Cancer Coimbra Data Set. https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Coimbra

Hayden, Luke. (2018, August). Principal Component Analysis in R Tutorial. Datacamp. https://www.datacamp.com/tutorial/pca-analysis-r

Patrício, M., Pereira, J., Crisóstomo, J. et al. Using Resistin, glucose, age and BMI to predict the presence of breast cancer. BMC Cancer 18, 29 (2018). https://doi.org/10.1186/s12885-017-3877-1

IBM. (2020). Kaiser-Meyer-Olkin measure for identity correlation matrix. https://www.ibm.com/support/pages/kaiser-meyer-olkin-measure-identity-correlation-matrix