Dimension reduction of the dataset containing health statistics of men prone to heart attacks

Introduction

Dimension reduction is very important process in data science. Using datasets which have many dimensions may result in overfitting the model. Many variabeles in model can add noise or makes the interpretation of the outcome less intuitive. For this purpose we reduce the number of variables to set of values of variables called principal components. Applying dimension reduction to a data set creates a data with the same number of observations but less variables, whilst maintaining the general information about observations.

The dataset used for the analysis contains statistcs from Western Cape, South Africa. It includes eight variables (sbp-systolic blood pressure, tobacco- cumulative tobacco, ldl -low density lipoprotein cholesterol,adiposity - a numeric vector,typea- type-A behavior, obesity - a numeric vector, alcohol -current alcohol consumption, age - age at onset), which describe health condition of 462 males.

Data preparation

At the beginning of the analyst a summary of the data was carried out.

summary(dane)
##       sbp           tobacco             ldl           adiposity    
##  Min.   :101.0   Min.   : 0.0000   Min.   : 0.980   Min.   : 6.74  
##  1st Qu.:124.0   1st Qu.: 0.0525   1st Qu.: 3.283   1st Qu.:19.77  
##  Median :134.0   Median : 2.0000   Median : 4.340   Median :26.11  
##  Mean   :138.3   Mean   : 3.6356   Mean   : 4.740   Mean   :25.41  
##  3rd Qu.:148.0   3rd Qu.: 5.5000   3rd Qu.: 5.790   3rd Qu.:31.23  
##  Max.   :218.0   Max.   :31.2000   Max.   :15.330   Max.   :42.49  
##      typea         obesity         alcohol            age       
##  Min.   :13.0   Min.   :14.70   Min.   :  0.00   Min.   :15.00  
##  1st Qu.:47.0   1st Qu.:22.98   1st Qu.:  0.51   1st Qu.:31.00  
##  Median :53.0   Median :25.80   Median :  7.51   Median :45.00  
##  Mean   :53.1   Mean   :26.04   Mean   : 17.04   Mean   :42.82  
##  3rd Qu.:60.0   3rd Qu.:28.50   3rd Qu.: 23.89   3rd Qu.:55.00  
##  Max.   :78.0   Max.   :46.58   Max.   :147.19   Max.   :64.00

It is visible that variables differ in scale and distribution. For this purpose, it was decided to standardize the data.

scale.data<-scale(dane)
summary(scale.data)
##       sbp             tobacco             ldl            adiposity       
##  Min.   :-1.8211   Min.   :-0.7916   Min.   :-1.8158   Min.   :-2.39911  
##  1st Qu.:-0.6990   1st Qu.:-0.7801   1st Qu.:-0.7040   1st Qu.:-0.72381  
##  Median :-0.2111   Median :-0.3561   Median :-0.1933   Median : 0.09103  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.4719   3rd Qu.: 0.4059   3rd Qu.: 0.5069   3rd Qu.: 0.74810  
##  Max.   : 3.8872   Max.   : 6.0014   Max.   : 5.1135   Max.   : 2.19560  
##      typea             obesity            alcohol             age         
##  Min.   :-4.08493   Min.   :-2.69221   Min.   :-0.6962   Min.   :-1.9040  
##  1st Qu.:-0.62173   1st Qu.:-0.72599   1st Qu.:-0.6754   1st Qu.:-0.8088  
##  Median :-0.01058   Median :-0.05675   Median :-0.3895   Median : 0.1495  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.70243   3rd Qu.: 0.58224   3rd Qu.: 0.2797   3rd Qu.: 0.8340  
##  Max.   : 2.53588   Max.   : 4.87362   Max.   : 5.3162   Max.   : 1.4501

Applying multidimensional scaling

Commonly used technique that can be used in dimension reduction is the multidimensional scaling. This approach is based on a hypothesis, that there are few most informative variables, that include the larges amount of infomation. That multivariate data analysis approach used to visualize the similarity between samples by plotting points in two dimensional plots. There are two types of MDS algorithms. Classical multidimensional scaling preserves the original distance metric, between points, as well as possible. Second type is non-metric multidimensional scaling. This method does not use a pure metric distance between variables but value in relation to the distances between other pairs of objects. Now let’s apply these metohods on our dataset.

Classical MDS

colnames(ds) <- c("Dim.1", "Dim.2")
ggscatter(ds, x = "Dim.1", y = "Dim.2", 
          label = rownames(scale.data),
          size = 1,
          repel = TRUE)

Non-metric MDS

colnames(ds) <- c("Dim.1", "Dim.2")
ggscatter(ds, x = "Dim.1", y = "Dim.2", 
          label = rownames(scale.data),
          size = 1)

As we can see in the two charts presented, the results are quite similar. Larger differences would be noticeable after clustering a set of data prepared in this way, but this is not the aim of the presented work.

Principal component analysis

Principal component analysis (PCA) is a technique that creates a linear transformation of the data and looks for directions in space that have highest variance. It is useful tool to explore data, allowing researcher to better visualize the variation present in a dataset with many variables. By PCA method we obtain uncorrelated variables in a model that are orthogonal to the previous existing ones.

pca1<-prcomp(scale.data, center=TRUE, scale.=FALSE)
summary(pca1)
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.6764 1.0941 1.0288 0.9211 0.87295 0.81910 0.69081
## Proportion of Variance 0.3513 0.1496 0.1323 0.1061 0.09526 0.08387 0.05965
## Cumulative Proportion  0.3513 0.5009 0.6332 0.7393 0.83455 0.91842 0.97807
##                            PC8
## Standard deviation     0.41885
## Proportion of Variance 0.02193
## Cumulative Proportion  1.00000

The cumulative proportion of PCA shows us, that first two principal components explain 50.1% of the variance, when first three explain 63.3% . As expected, the proportion of variance decreases in relation to number of component.

wykres = theme(plot.title = element_text(size = 10, face = "bold"), 
                 axis.text = element_text(size = 14),
                 axis.title = element_text(size = 14))
fviz_eig(pca1) + wykres

On scree plot we can see how many dimensions should be included in the research to preserve the highest possible explanatory value and keeping only necessary variables. For the purposes of further visualization of the results, let’s assume that the optimal number is two.

library(factoextra)
fviz_contrib(pca1, choice = "var", axes = 2) + wykres

As we can see the biggest contribution to principal components had alcohol, tobacco, obesity and ldl variables. On the next plot observations were placed on a two- dimensional graph. Those observations that are close to each other are similar, whereas observations that are far away are different. Arrows characterizing the variables were also marked on the graph. For example we can see that observations such as 417 and 398 are characterized by high alcohol and tobacco consumption.

fviz_pca_biplot(pca1,
                col.var = "#2E9FDF", 
                col.ind = "#696969"  
                )

The last graph that was created was the circle plot which shows the correlation between the variables.

dane.pca <- prcomp(scale.data, center=TRUE, scale.=FALSE)
fviz_pca_var(dane.pca, col.var="black", labelsize = 6) + wykres

Negatively correlated variables are on the opposite parts of the circle and if there is no correlation, the arrows are orthogonal. As we can see some variables are positively correlated, such as alcohol, tobacco, systolic blood pressure, age - they are presented on the same side of the circle, close to each other. Second group of variables correlated with each other are adiposity, obesity and low density lipoprotein cholesterol. Type-A behavior is variable least correlated with others.

Concusions

Dimension reduction is really useful toll in data science. It helps to prepare data before more complex analysis. The use of dimensional reduction methods helps to prevent overfitting and computation complexity of models.

Techniques that was used in presented in the paper was multidimensional scaling and principal component analysis. Conceptually, there is close correspondences between MDS and PCA. However, these methods focus on different things. MDS is more focused on relations among scaled objects while PCA is about dimensions themselves. It is looking for maximize explained variance. In addition, MDS displays n-dimensional data in two-dimensions, placing similar observations close to each other, while PCA using covariance/correlation matrix to analyze the correlation between data points and variables can reduce dimensions of data to number different than two.

In the case of data used in presented paper, it turned out that the most valuable variables are alcohol, tobacco and obesity. The other variables did not provide as much information as they were strongly correlated with them. This was seen in the circle plot at the end of the paper.