Dimension reduction is very important process in data science. Using datasets which have many dimensions may result in overfitting the model. Many variabeles in model can add noise or makes the interpretation of the outcome less intuitive. For this purpose we reduce the number of variables to set of values of variables called principal components. Applying dimension reduction to a data set creates a data with the same number of observations but less variables, whilst maintaining the general information about observations.
The dataset used for the analysis contains statistcs from Western Cape, South Africa. It includes eight variables (sbp-systolic blood pressure, tobacco- cumulative tobacco, ldl -low density lipoprotein cholesterol,adiposity - a numeric vector,typea- type-A behavior, obesity - a numeric vector, alcohol -current alcohol consumption, age - age at onset), which describe health condition of 462 males.
At the beginning of the analyst a summary of the data was carried out.
summary(dane)
## sbp tobacco ldl adiposity
## Min. :101.0 Min. : 0.0000 Min. : 0.980 Min. : 6.74
## 1st Qu.:124.0 1st Qu.: 0.0525 1st Qu.: 3.283 1st Qu.:19.77
## Median :134.0 Median : 2.0000 Median : 4.340 Median :26.11
## Mean :138.3 Mean : 3.6356 Mean : 4.740 Mean :25.41
## 3rd Qu.:148.0 3rd Qu.: 5.5000 3rd Qu.: 5.790 3rd Qu.:31.23
## Max. :218.0 Max. :31.2000 Max. :15.330 Max. :42.49
## typea obesity alcohol age
## Min. :13.0 Min. :14.70 Min. : 0.00 Min. :15.00
## 1st Qu.:47.0 1st Qu.:22.98 1st Qu.: 0.51 1st Qu.:31.00
## Median :53.0 Median :25.80 Median : 7.51 Median :45.00
## Mean :53.1 Mean :26.04 Mean : 17.04 Mean :42.82
## 3rd Qu.:60.0 3rd Qu.:28.50 3rd Qu.: 23.89 3rd Qu.:55.00
## Max. :78.0 Max. :46.58 Max. :147.19 Max. :64.00
It is visible that variables differ in scale and distribution. For this purpose, it was decided to standardize the data.
scale.data<-scale(dane)
summary(scale.data)
## sbp tobacco ldl adiposity
## Min. :-1.8211 Min. :-0.7916 Min. :-1.8158 Min. :-2.39911
## 1st Qu.:-0.6990 1st Qu.:-0.7801 1st Qu.:-0.7040 1st Qu.:-0.72381
## Median :-0.2111 Median :-0.3561 Median :-0.1933 Median : 0.09103
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.4719 3rd Qu.: 0.4059 3rd Qu.: 0.5069 3rd Qu.: 0.74810
## Max. : 3.8872 Max. : 6.0014 Max. : 5.1135 Max. : 2.19560
## typea obesity alcohol age
## Min. :-4.08493 Min. :-2.69221 Min. :-0.6962 Min. :-1.9040
## 1st Qu.:-0.62173 1st Qu.:-0.72599 1st Qu.:-0.6754 1st Qu.:-0.8088
## Median :-0.01058 Median :-0.05675 Median :-0.3895 Median : 0.1495
## Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.70243 3rd Qu.: 0.58224 3rd Qu.: 0.2797 3rd Qu.: 0.8340
## Max. : 2.53588 Max. : 4.87362 Max. : 5.3162 Max. : 1.4501
Commonly used technique that can be used in dimension reduction is the multidimensional scaling. This approach is based on a hypothesis, that there are few most informative variables, that include the larges amount of infomation. That multivariate data analysis approach used to visualize the similarity between samples by plotting points in two dimensional plots. There are two types of MDS algorithms. Classical multidimensional scaling preserves the original distance metric, between points, as well as possible. Second type is non-metric multidimensional scaling. This method does not use a pure metric distance between variables but value in relation to the distances between other pairs of objects. Now let’s apply these metohods on our dataset.
colnames(ds) <- c("Dim.1", "Dim.2")
ggscatter(ds, x = "Dim.1", y = "Dim.2",
label = rownames(scale.data),
size = 1,
repel = TRUE)
colnames(ds) <- c("Dim.1", "Dim.2")
ggscatter(ds, x = "Dim.1", y = "Dim.2",
label = rownames(scale.data),
size = 1)
As we can see in the two charts presented, the results are quite similar. Larger differences would be noticeable after clustering a set of data prepared in this way, but this is not the aim of the presented work.
Principal component analysis (PCA) is a technique that creates a linear transformation of the data and looks for directions in space that have highest variance. It is useful tool to explore data, allowing researcher to better visualize the variation present in a dataset with many variables. By PCA method we obtain uncorrelated variables in a model that are orthogonal to the previous existing ones.
pca1<-prcomp(scale.data, center=TRUE, scale.=FALSE)
summary(pca1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.6764 1.0941 1.0288 0.9211 0.87295 0.81910 0.69081
## Proportion of Variance 0.3513 0.1496 0.1323 0.1061 0.09526 0.08387 0.05965
## Cumulative Proportion 0.3513 0.5009 0.6332 0.7393 0.83455 0.91842 0.97807
## PC8
## Standard deviation 0.41885
## Proportion of Variance 0.02193
## Cumulative Proportion 1.00000
The cumulative proportion of PCA shows us, that first two principal components explain 50.1% of the variance, when first three explain 63.3% . As expected, the proportion of variance decreases in relation to number of component.
wykres = theme(plot.title = element_text(size = 10, face = "bold"),
axis.text = element_text(size = 14),
axis.title = element_text(size = 14))
fviz_eig(pca1) + wykres
On scree plot we can see how many dimensions should be included in the research to preserve the highest possible explanatory value and keeping only necessary variables. For the purposes of further visualization of the results, let’s assume that the optimal number is two.
library(factoextra)
fviz_contrib(pca1, choice = "var", axes = 2) + wykres
As we can see the biggest contribution to principal components had alcohol, tobacco, obesity and ldl variables. On the next plot observations were placed on a two- dimensional graph. Those observations that are close to each other are similar, whereas observations that are far away are different. Arrows characterizing the variables were also marked on the graph. For example we can see that observations such as 417 and 398 are characterized by high alcohol and tobacco consumption.
fviz_pca_biplot(pca1,
col.var = "#2E9FDF",
col.ind = "#696969"
)
The last graph that was created was the circle plot which shows the correlation between the variables.
dane.pca <- prcomp(scale.data, center=TRUE, scale.=FALSE)
fviz_pca_var(dane.pca, col.var="black", labelsize = 6) + wykres
Negatively correlated variables are on the opposite parts of the circle and if there is no correlation, the arrows are orthogonal. As we can see some variables are positively correlated, such as alcohol, tobacco, systolic blood pressure, age - they are presented on the same side of the circle, close to each other. Second group of variables correlated with each other are adiposity, obesity and low density lipoprotein cholesterol. Type-A behavior is variable least correlated with others.
Dimension reduction is really useful toll in data science. It helps to prepare data before more complex analysis. The use of dimensional reduction methods helps to prevent overfitting and computation complexity of models.
Techniques that was used in presented in the paper was multidimensional scaling and principal component analysis. Conceptually, there is close correspondences between MDS and PCA. However, these methods focus on different things. MDS is more focused on relations among scaled objects while PCA is about dimensions themselves. It is looking for maximize explained variance. In addition, MDS displays n-dimensional data in two-dimensions, placing similar observations close to each other, while PCA using covariance/correlation matrix to analyze the correlation between data points and variables can reduce dimensions of data to number different than two.
In the case of data used in presented paper, it turned out that the most valuable variables are alcohol, tobacco and obesity. The other variables did not provide as much information as they were strongly correlated with them. This was seen in the circle plot at the end of the paper.