1.Introduction

Principal Component Analysis (PCA) is the most common method of data dimensionality reduction. According to Wikipedia, it was first proposed by Karl Pearson (who also invented the chi-square test) in 1901, and has been around for more than a hundred years now. As a method of dimensionality reduction, PCA reduces redundancy and noise by linearly transforming the original data and identifying the major components of the data that have the highest information content and removing the less informative components.

The main focus of this paper is to use PCA method to downscale the data on patients with cardiovascular diseases, to reduce the noise in the data and to identify the factors affecting cardiovascular diseases. The dataset of this paper has 462 observations and contains 9 variables which are explained below:
1.sbp: systolic blood pressure
2.tobacco: cumulative tobacco (kg)
3.ldl: low densiity lipoprotein cholesterol
4.adiposity
5.typea: type-A behavior
6.obesity
7.alcohol: current alcohol consumption
8.age: age at onset
9.chd: coronary heart disease

2.Libraries

library(corrplot)
library(clusterSim) 
library(GGally)
library(qgraph)
library(smacof)
library(ggbiplot)
library(factoextra)
library(gridExtra)

3.Data processing

Read the data and check for missing values. From the results below, there are no missing values and no data processing is required. Then normalize the data.

#read data
# setwd("E:/Master of Data science/6-unsupervised learing/7-无监督学习project/archive")
data<-read.table("cardiovascular.txt",head=T,sep = ";",row.names = 1)
#checking for missing value
data[!complete.cases(data),]
##  [1] sbp       tobacco   ldl       adiposity famhist   typea     obesity  
##  [8] alcohol   age       chd      
## <0 行> (或0-长度的row.names)
data1<-data[,c(1:4,6:9)]
head(data1)
##   sbp tobacco  ldl adiposity typea obesity alcohol age
## 1 160   12.00 5.73     23.11    49   25.30   97.20  52
## 2 144    0.01 4.41     28.61    55   28.87    2.06  63
## 3 118    0.08 3.48     32.28    52   29.14    3.81  46
## 4 170    7.50 6.41     38.03    51   31.99   24.26  58
## 5 134   13.60 3.50     27.78    60   25.99   57.34  49
## 6 132    6.20 6.47     36.21    62   30.77   14.14  45
#Standardization of data
data_new<-data.Normalization(data1, type="n1",normalization="column")

4.Relationship between variables

At the beggining lets get look into data, to see basic statistics.

head(data_new,3) 
##          sbp    tobacco        ldl  adiposity      typea    obesity    alcohol
## 1  1.0574173  1.8210988  0.4778941 -0.2951832 -0.4180170 -0.1765945  3.2741887
## 2  0.2767892 -0.7893817 -0.1595071  0.4116942  0.1931344  0.6706459 -0.6120811
## 3 -0.9917313 -0.7741412 -0.6085852  0.8833742 -0.1124413  0.7347229 -0.5405973
##         age
## 1 0.6286543
## 2 1.3816170
## 3 0.2179473
summary(data_new)
##       sbp             tobacco             ldl            adiposity       
##  Min.   :-1.8211   Min.   :-0.7916   Min.   :-1.8158   Min.   :-2.39911  
##  1st Qu.:-0.6990   1st Qu.:-0.7801   1st Qu.:-0.7040   1st Qu.:-0.72381  
##  Median :-0.2111   Median :-0.3561   Median :-0.1933   Median : 0.09103  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.4719   3rd Qu.: 0.4059   3rd Qu.: 0.5069   3rd Qu.: 0.74810  
##  Max.   : 3.8872   Max.   : 6.0014   Max.   : 5.1135   Max.   : 2.19560  
##      typea             obesity            alcohol             age         
##  Min.   :-4.08493   Min.   :-2.69221   Min.   :-0.6962   Min.   :-1.9040  
##  1st Qu.:-0.62173   1st Qu.:-0.72599   1st Qu.:-0.6754   1st Qu.:-0.8088  
##  Median :-0.01058   Median :-0.05675   Median :-0.3895   Median : 0.1495  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.70243   3rd Qu.: 0.58224   3rd Qu.: 0.2797   3rd Qu.: 0.8340  
##  Max.   : 2.53588   Max.   : 4.87362   Max.   : 5.3162   Max.   : 1.4501

For a deeper understanding of the data, it is very visual to include graphs of paired data variables.

data_new_cor<-cor(data_new, method="pearson") 
ggpairs(as.data.frame(data_new_cor))

In order to get a clearer picture of the relationship between the variables, I plotted a heat map of the correlation matrix between the variables, in which I can see a strong correlation between adiposity and age and obesity.

corrplot(data_new_cor, order ="alphabet", tl.cex=0.6)

I plotted the relationship of the correlation matrix, in which we can see that there is a more pronounced correlation between age and adp, a stronger relationship between obs and adp, but a more confusing relationship between the other data.

qgraph(cor(data_new), shape="rectangle", posCol="darkgreen", negCol="darkmagenta")

From the above analysis, we can see that the correlation between the variables is not very clear, so we need to perform dimensionality reduction through PCA to reduce the noise in the data.

5.Dimensions reduction

5.1 PCA

PCA is an unsupervised learning technique in machine learning that reduces the dimensionality of a highly ordered matrix to a lowly ordered matrix, capturing the essence of the original data by showing the maximum variation in the dataset (by focusing on the variables that actually distinguish the data). Next I will use PCA to downscale multidimensional data and reduce the noise in the data.

data_new_pca1<-prcomp(data_new, center=FALSE, scale.=FALSE)
data_new_pca1$rotation[, 1:3]
##                   PC1        PC2         PC3
## sbp       -0.33366847  0.2385342 -0.07063250
## tobacco   -0.30958566  0.4585801  0.09623456
## ldl       -0.33717322 -0.3639169 -0.01441214
## adiposity -0.52775285 -0.1874058 -0.03440724
## typea      0.02424838 -0.2826109  0.83098332
## obesity   -0.41280362 -0.3917102  0.12053210
## alcohol   -0.12033587  0.5428173  0.50417985
## age       -0.46382937  0.1931140 -0.15842917
summary(data_new_pca1)
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.6764 1.0941 1.0288 0.9211 0.87295 0.81910 0.69081
## Proportion of Variance 0.3513 0.1496 0.1323 0.1061 0.09526 0.08387 0.05965
## Cumulative Proportion  0.3513 0.5009 0.6332 0.7393 0.83455 0.91842 0.97807
##                            PC8
## Standard deviation     0.41885
## Proportion of Variance 0.02193
## Cumulative Proportion  1.00000

From the above results, it can be seen that PC1 explains 35.13% of all the variables, and the percentage of explained components of PC2 and PC3 is getting lower and lower, only 14.96% and 13.23%. As of PC5,the accumulated percentage is 83.46% and this result is better.

5.2 Optimal number of components

The Kaiser stopping rule is a method for deciding which components should be selected. In this method, components with eigenvalues greater than 1 should be retained. It is also related to the scree test method, where the eigenvalues are labeled on the vertical axis and the components are labeled on the horizontal axis. The components are arranged in order from largest to smallest and the number of components is chosen according to the elbow rule. If the eigenvalue line tends to be stable, we should choose that number of components. Another way to look at this is to look at the percentage of variance explained; if the components explain 70-90% of the variance, that’s good.

fviz_eig(data_new_pca1, choice='eigenvalue')    

According to Kaiser stopping rule, we should choose the first 3 components because only the first three eigenvalues are greater than 1.

fviz_eig(data_new_pca1)

summary(data_new_pca1)
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.6764 1.0941 1.0288 0.9211 0.87295 0.81910 0.69081
## Proportion of Variance 0.3513 0.1496 0.1323 0.1061 0.09526 0.08387 0.05965
## Cumulative Proportion  0.3513 0.5009 0.6332 0.7393 0.83455 0.91842 0.97807
##                            PC8
## Standard deviation     0.41885
## Proportion of Variance 0.02193
## Cumulative Proportion  1.00000

According to the above results, we can find that PC3 only explains 63.32% of the components, and the cumulative explained components are low. Generally the cumulative explained components are better between 70% and 90%, so we should take PC4 and PC5 components into account as well, and the cumulative explained components reach 83.46%.

5.3 Analysis of components

fviz_pca_var(data_new_pca1, col.var = "steelblue")

data_new_pca1$rotation[,1:6]
##                   PC1        PC2         PC3        PC4        PC5         PC6
## sbp       -0.33366847  0.2385342 -0.07063250 -0.2997153 -0.7752970  0.27512636
## tobacco   -0.30958566  0.4585801  0.09623456  0.5848185  0.1123836 -0.25372962
## ldl       -0.33717322 -0.3639169 -0.01441214  0.3039581  0.2253898  0.76966892
## adiposity -0.52775285 -0.1874058 -0.03440724 -0.1579115  0.1242232 -0.20231058
## typea      0.02424838 -0.2826109  0.83098332  0.2655895 -0.3537248 -0.08122072
## obesity   -0.41280362 -0.3917102  0.12053210 -0.3721910  0.1598844 -0.35927308
## alcohol   -0.12033587  0.5428173  0.50417985 -0.4289256  0.4021169  0.27887660
## age       -0.46382937  0.1931140 -0.15842917  0.2403964 -0.0875795 -0.11472064

The main components of the ingredients can be roughly seen from the figure, but we can also plot the contribution of the variables to the components to clearly determine the main components of each ingredient.

data_new_pca1$rotation[,1:6]
##                   PC1        PC2         PC3        PC4        PC5         PC6
## sbp       -0.33366847  0.2385342 -0.07063250 -0.2997153 -0.7752970  0.27512636
## tobacco   -0.30958566  0.4585801  0.09623456  0.5848185  0.1123836 -0.25372962
## ldl       -0.33717322 -0.3639169 -0.01441214  0.3039581  0.2253898  0.76966892
## adiposity -0.52775285 -0.1874058 -0.03440724 -0.1579115  0.1242232 -0.20231058
## typea      0.02424838 -0.2826109  0.83098332  0.2655895 -0.3537248 -0.08122072
## obesity   -0.41280362 -0.3917102  0.12053210 -0.3721910  0.1598844 -0.35927308
## alcohol   -0.12033587  0.5428173  0.50417985 -0.4289256  0.4021169  0.27887660
## age       -0.46382937  0.1931140 -0.15842917  0.2403964 -0.0875795 -0.11472064
var <- get_pca_var(data_new_pca1)
a<-fviz_contrib(data_new_pca1, "var",axes = 1)
b<-fviz_contrib(data_new_pca1, "var",axes = 2)
c<-fviz_contrib(data_new_pca1, "var",axes = 3)
d<-fviz_contrib(data_new_pca1, "var",axes = 4)
e<-fviz_contrib(data_new_pca1, "var",axes = 5)
f<-fviz_contrib(data_new_pca1, "var",axes = 6)
grid.arrange(a,b,c,d,e,f,top='Contribution to the Principal Components')

As can be seen from the above figure, the first component mainly contains: adiposity, age, obesity,ldl; the second component mainly contains: alcohol, tabacco, obesity, ldl; the third component mainly contains: typea, alcohol; the forth component mainly contains: tobacco, alcohol, obesity; the fifth component mainly contains: sbp, alcohol; the sixth component mainly contains: ldl, obesity.

5.4 Biplot with distinction of disease of the PCA results

The results of PCA can also be shown on biplot with distinction of disease.(1=patient,0=healthy)

ggbiplot(data_new_pca1, obs.scale=1, var.scale=1, groups=as.factor(data$chd), ellipse=TRUE, circle = TRUE)

As can be seen from the above graph, adiposity, obesity, ldl, and age have a greater impact on coronary heart disease, and these factors are also major components of the first component.

6.Conclusion

In conclusion, according to the above analysis, it can be seen that the main factors affecting coronary heart disease are adiposity, obesity, ldl, and age, in ordinary life, coronary heart disease patients should pay attention to controlling the fat content, the degree of obesity of the body, and to exercise, to create a healthy body.