Principal Component Analysis (PCA) is the most common method of data dimensionality reduction. According to Wikipedia, it was first proposed by Karl Pearson (who also invented the chi-square test) in 1901, and has been around for more than a hundred years now. As a method of dimensionality reduction, PCA reduces redundancy and noise by linearly transforming the original data and identifying the major components of the data that have the highest information content and removing the less informative components.
The main focus of this paper is to use PCA method to downscale the
data on patients with cardiovascular diseases, to reduce the noise in
the data and to identify the factors affecting cardiovascular diseases.
The dataset of this paper has 462 observations and contains 9 variables
which are explained below:
1.sbp: systolic blood pressure
2.tobacco: cumulative tobacco (kg)
3.ldl: low densiity lipoprotein cholesterol
4.adiposity
5.typea: type-A behavior
6.obesity
7.alcohol: current alcohol consumption
8.age: age at onset
9.chd: coronary heart disease
library(corrplot)
library(clusterSim)
library(GGally)
library(qgraph)
library(smacof)
library(ggbiplot)
library(factoextra)
library(gridExtra)
Read the data and check for missing values. From the results below, there are no missing values and no data processing is required. Then normalize the data.
#read data
# setwd("E:/Master of Data science/6-unsupervised learing/7-无监督学习project/archive")
data<-read.table("cardiovascular.txt",head=T,sep = ";",row.names = 1)
#checking for missing value
data[!complete.cases(data),]
## [1] sbp tobacco ldl adiposity famhist typea obesity
## [8] alcohol age chd
## <0 行> (或0-长度的row.names)
data1<-data[,c(1:4,6:9)]
head(data1)
## sbp tobacco ldl adiposity typea obesity alcohol age
## 1 160 12.00 5.73 23.11 49 25.30 97.20 52
## 2 144 0.01 4.41 28.61 55 28.87 2.06 63
## 3 118 0.08 3.48 32.28 52 29.14 3.81 46
## 4 170 7.50 6.41 38.03 51 31.99 24.26 58
## 5 134 13.60 3.50 27.78 60 25.99 57.34 49
## 6 132 6.20 6.47 36.21 62 30.77 14.14 45
#Standardization of data
data_new<-data.Normalization(data1, type="n1",normalization="column")
At the beggining lets get look into data, to see basic statistics.
head(data_new,3)
## sbp tobacco ldl adiposity typea obesity alcohol
## 1 1.0574173 1.8210988 0.4778941 -0.2951832 -0.4180170 -0.1765945 3.2741887
## 2 0.2767892 -0.7893817 -0.1595071 0.4116942 0.1931344 0.6706459 -0.6120811
## 3 -0.9917313 -0.7741412 -0.6085852 0.8833742 -0.1124413 0.7347229 -0.5405973
## age
## 1 0.6286543
## 2 1.3816170
## 3 0.2179473
summary(data_new)
## sbp tobacco ldl adiposity
## Min. :-1.8211 Min. :-0.7916 Min. :-1.8158 Min. :-2.39911
## 1st Qu.:-0.6990 1st Qu.:-0.7801 1st Qu.:-0.7040 1st Qu.:-0.72381
## Median :-0.2111 Median :-0.3561 Median :-0.1933 Median : 0.09103
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.4719 3rd Qu.: 0.4059 3rd Qu.: 0.5069 3rd Qu.: 0.74810
## Max. : 3.8872 Max. : 6.0014 Max. : 5.1135 Max. : 2.19560
## typea obesity alcohol age
## Min. :-4.08493 Min. :-2.69221 Min. :-0.6962 Min. :-1.9040
## 1st Qu.:-0.62173 1st Qu.:-0.72599 1st Qu.:-0.6754 1st Qu.:-0.8088
## Median :-0.01058 Median :-0.05675 Median :-0.3895 Median : 0.1495
## Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.70243 3rd Qu.: 0.58224 3rd Qu.: 0.2797 3rd Qu.: 0.8340
## Max. : 2.53588 Max. : 4.87362 Max. : 5.3162 Max. : 1.4501
For a deeper understanding of the data, it is very visual to include graphs of paired data variables.
data_new_cor<-cor(data_new, method="pearson")
ggpairs(as.data.frame(data_new_cor))
In order to get a clearer picture of the relationship between the variables, I plotted a heat map of the correlation matrix between the variables, in which I can see a strong correlation between adiposity and age and obesity.
corrplot(data_new_cor, order ="alphabet", tl.cex=0.6)
I plotted the relationship of the correlation matrix, in which we can see that there is a more pronounced correlation between age and adp, a stronger relationship between obs and adp, but a more confusing relationship between the other data.
qgraph(cor(data_new), shape="rectangle", posCol="darkgreen", negCol="darkmagenta")
From the above analysis, we can see that the correlation between the variables is not very clear, so we need to perform dimensionality reduction through PCA to reduce the noise in the data.
PCA is an unsupervised learning technique in machine learning that reduces the dimensionality of a highly ordered matrix to a lowly ordered matrix, capturing the essence of the original data by showing the maximum variation in the dataset (by focusing on the variables that actually distinguish the data). Next I will use PCA to downscale multidimensional data and reduce the noise in the data.
data_new_pca1<-prcomp(data_new, center=FALSE, scale.=FALSE)
data_new_pca1$rotation[, 1:3]
## PC1 PC2 PC3
## sbp -0.33366847 0.2385342 -0.07063250
## tobacco -0.30958566 0.4585801 0.09623456
## ldl -0.33717322 -0.3639169 -0.01441214
## adiposity -0.52775285 -0.1874058 -0.03440724
## typea 0.02424838 -0.2826109 0.83098332
## obesity -0.41280362 -0.3917102 0.12053210
## alcohol -0.12033587 0.5428173 0.50417985
## age -0.46382937 0.1931140 -0.15842917
summary(data_new_pca1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.6764 1.0941 1.0288 0.9211 0.87295 0.81910 0.69081
## Proportion of Variance 0.3513 0.1496 0.1323 0.1061 0.09526 0.08387 0.05965
## Cumulative Proportion 0.3513 0.5009 0.6332 0.7393 0.83455 0.91842 0.97807
## PC8
## Standard deviation 0.41885
## Proportion of Variance 0.02193
## Cumulative Proportion 1.00000
From the above results, it can be seen that PC1 explains 35.13% of all the variables, and the percentage of explained components of PC2 and PC3 is getting lower and lower, only 14.96% and 13.23%. As of PC5,the accumulated percentage is 83.46% and this result is better.
The Kaiser stopping rule is a method for deciding which components should be selected. In this method, components with eigenvalues greater than 1 should be retained. It is also related to the scree test method, where the eigenvalues are labeled on the vertical axis and the components are labeled on the horizontal axis. The components are arranged in order from largest to smallest and the number of components is chosen according to the elbow rule. If the eigenvalue line tends to be stable, we should choose that number of components. Another way to look at this is to look at the percentage of variance explained; if the components explain 70-90% of the variance, that’s good.
fviz_eig(data_new_pca1, choice='eigenvalue')
According to Kaiser stopping rule, we should choose the first 3 components because only the first three eigenvalues are greater than 1.
fviz_eig(data_new_pca1)
summary(data_new_pca1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.6764 1.0941 1.0288 0.9211 0.87295 0.81910 0.69081
## Proportion of Variance 0.3513 0.1496 0.1323 0.1061 0.09526 0.08387 0.05965
## Cumulative Proportion 0.3513 0.5009 0.6332 0.7393 0.83455 0.91842 0.97807
## PC8
## Standard deviation 0.41885
## Proportion of Variance 0.02193
## Cumulative Proportion 1.00000
According to the above results, we can find that PC3 only explains 63.32% of the components, and the cumulative explained components are low. Generally the cumulative explained components are better between 70% and 90%, so we should take PC4 and PC5 components into account as well, and the cumulative explained components reach 83.46%.
fviz_pca_var(data_new_pca1, col.var = "steelblue")
data_new_pca1$rotation[,1:6]
## PC1 PC2 PC3 PC4 PC5 PC6
## sbp -0.33366847 0.2385342 -0.07063250 -0.2997153 -0.7752970 0.27512636
## tobacco -0.30958566 0.4585801 0.09623456 0.5848185 0.1123836 -0.25372962
## ldl -0.33717322 -0.3639169 -0.01441214 0.3039581 0.2253898 0.76966892
## adiposity -0.52775285 -0.1874058 -0.03440724 -0.1579115 0.1242232 -0.20231058
## typea 0.02424838 -0.2826109 0.83098332 0.2655895 -0.3537248 -0.08122072
## obesity -0.41280362 -0.3917102 0.12053210 -0.3721910 0.1598844 -0.35927308
## alcohol -0.12033587 0.5428173 0.50417985 -0.4289256 0.4021169 0.27887660
## age -0.46382937 0.1931140 -0.15842917 0.2403964 -0.0875795 -0.11472064
The main components of the ingredients can be roughly seen from the figure, but we can also plot the contribution of the variables to the components to clearly determine the main components of each ingredient.
data_new_pca1$rotation[,1:6]
## PC1 PC2 PC3 PC4 PC5 PC6
## sbp -0.33366847 0.2385342 -0.07063250 -0.2997153 -0.7752970 0.27512636
## tobacco -0.30958566 0.4585801 0.09623456 0.5848185 0.1123836 -0.25372962
## ldl -0.33717322 -0.3639169 -0.01441214 0.3039581 0.2253898 0.76966892
## adiposity -0.52775285 -0.1874058 -0.03440724 -0.1579115 0.1242232 -0.20231058
## typea 0.02424838 -0.2826109 0.83098332 0.2655895 -0.3537248 -0.08122072
## obesity -0.41280362 -0.3917102 0.12053210 -0.3721910 0.1598844 -0.35927308
## alcohol -0.12033587 0.5428173 0.50417985 -0.4289256 0.4021169 0.27887660
## age -0.46382937 0.1931140 -0.15842917 0.2403964 -0.0875795 -0.11472064
var <- get_pca_var(data_new_pca1)
a<-fviz_contrib(data_new_pca1, "var",axes = 1)
b<-fviz_contrib(data_new_pca1, "var",axes = 2)
c<-fviz_contrib(data_new_pca1, "var",axes = 3)
d<-fviz_contrib(data_new_pca1, "var",axes = 4)
e<-fviz_contrib(data_new_pca1, "var",axes = 5)
f<-fviz_contrib(data_new_pca1, "var",axes = 6)
grid.arrange(a,b,c,d,e,f,top='Contribution to the Principal Components')
As can be seen from the above figure, the first component mainly contains: adiposity, age, obesity,ldl; the second component mainly contains: alcohol, tabacco, obesity, ldl; the third component mainly contains: typea, alcohol; the forth component mainly contains: tobacco, alcohol, obesity; the fifth component mainly contains: sbp, alcohol; the sixth component mainly contains: ldl, obesity.
The results of PCA can also be shown on biplot with distinction of disease.(1=patient,0=healthy)
ggbiplot(data_new_pca1, obs.scale=1, var.scale=1, groups=as.factor(data$chd), ellipse=TRUE, circle = TRUE)
As can be seen from the above graph, adiposity, obesity, ldl, and age have a greater impact on coronary heart disease, and these factors are also major components of the first component.
In conclusion, according to the above analysis, it can be seen that the main factors affecting coronary heart disease are adiposity, obesity, ldl, and age, in ordinary life, coronary heart disease patients should pay attention to controlling the fat content, the degree of obesity of the body, and to exercise, to create a healthy body.