Dimension Reduction of Diabetes Health Indicators Dataset

Introduction

This article aims to use Principal Component Analysis (PCA) and Multidimensional scaling (MDS) for dimension reduction of the Diabetes Health Indicators Dataset cleaned from The Behavioral Risk Factor Surveillance System (BRFSS) 2015 dataset. The cleaned dataset can be found on Kaggle website (Alex, 2021).

Principal Component Analysis (PCA) is an unsupervised, non-parametric statistical technique primarily used for dimension reduction in machine learning. It considers the total variance in the data and transforms the original variables into a smaller set of linear combinations.

Multidimensional scaling (MDS) is a technique for visualizing distances between objects, where the distance is known between pairs of the objects

Dimension Reduction is a method to decrease the number of features to describe an observation while maintaining the maximum information content.

Dataset

Original Dataset

diabetes_data <- read.csv("diabetes_BRFSS2015.csv")
sprintf("The original dataset has %s observations with a total of %s features.", nrow(diabetes_data), ncol(diabetes_data))

## [1] "The original dataset has 70692 observations with a total of 21 features."

Description of Variables

HighBP: 0 = no high BP 1 = high BP

HighChol: 0 = no high cholesterol 1 = high cholesterol

CholCheck: 0 = no cholesterol check in 5 years 1 = yes cholesterol check in 5 years

BMI: Body Mass Index

MentHlth: days of poor mental health scale 1-30 days

DiffWalk: difficulty walking or climbing stairs? 0 = no 1 = yes

Age: 13-level age category 1=18-24, 5=40-44, 6=45-49, 8=55-59, 13=80 or older

Income: scale 1-8; 1 = less than $10,000, 8 = $75,000 or more

Education: scale 1-6 1 = Never attended school/only kindergarten 2 = elementary

Other variables description can be seen at https://www.cdc.gov/brfss/annual_data/2015/pdf/codebook15_llcp.pdf

We will work with a reduced dataset with 7,069 observations(10% of original observations) randomly selected from the original dataset.

diabetes_df <- diabetes_data[sample(1:nrow(diabetes_data),7069,replace=F), ]
sprintf("The new dataset has %s observations with a total of %s features.", nrow(diabetes_df), ncol(diabetes_df))

## [1] "The new dataset has 7069 observations with a total of 21 features."

Packages Used for Analysis

library(Hmisc)
library(corrplot)
library(psych)
library(psy)
library(gridExtra)
library(ggplot2)
library(caret)
library(factoextra)
library(maptools)
library(dplyr)

Data Preprocessing and Analysis

Dataset Structure

str(diabetes_df)

## 'data.frame':    7069 obs. of  21 variables:
##  $ HighBP              : int  0 1 0 0 0 1 1 0 1 0 ...
##  $ HighChol            : int  0 1 0 0 0 1 0 0 1 1 ...
##  $ CholCheck           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ BMI                 : int  22 35 22 24 27 28 43 23 21 29 ...
##  $ Smoker              : int  0 0 0 1 0 0 1 0 1 1 ...
##  $ Stroke              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ HeartDiseaseorAttack: int  0 0 0 0 0 0 0 0 1 0 ...
##  $ PhysActivity        : int  1 0 0 1 1 0 1 1 1 1 ...
##  $ Fruits              : int  1 1 1 1 0 0 1 1 1 1 ...
##  $ Veggies             : int  1 1 1 1 1 0 1 1 1 1 ...
##  $ HvyAlcoholConsump   : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ AnyHealthcare       : int  1 1 1 1 1 0 1 1 1 1 ...
##  $ NoDocbcCost         : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ GenHlth             : int  3 4 1 1 2 2 4 2 4 2 ...
##  $ MentHlth            : int  0 30 0 0 1 0 0 0 5 0 ...
##  $ PhysHlth            : int  2 7 0 0 0 5 0 0 2 0 ...
##  $ DiffWalk            : int  0 0 0 0 0 0 1 0 1 0 ...
##  $ Sex                 : int  1 0 0 0 0 1 1 0 0 0 ...
##  $ Age                 : int  11 8 9 11 3 6 11 10 13 4 ...
##  $ Education           : int  6 5 6 6 6 4 5 6 5 5 ...
##  $ Income              : int  7 2 8 7 8 8 5 6 7 7 ...

Missing Values

sprintf("The dataset contain %s missing values .",sum(is.na(diabetes_df)))

## [1] "The dataset contain 0 missing values ."

Dataset Normalisation

preprocess <- preProcess(diabetes_df, method=c("center", "scale"))
diabetes_df_norm <- predict(preprocess, diabetes_df)

Correlation Matrix

cor_matrix <- cor(diabetes_df_norm)
corrplot(cor_matrix, type = "upper", order = "hclust", tl.col = "black", tl.cex = 0.6)

Principal Component Analysis (PCA)

diabetes_pca <- prcomp(diabetes_df_norm, center=FALSE, scale=FALSE)
summary(diabetes_pca)

## Importance of components:
##                           PC1     PC2     PC3     PC4    PC5     PC6     PC7
## Standard deviation     1.9028 1.30542 1.15586 1.10107 1.0854 1.03733 1.02311
## Proportion of Variance 0.1724 0.08115 0.06362 0.05773 0.0561 0.05124 0.04985
## Cumulative Proportion  0.1724 0.25357 0.31719 0.37492 0.4310 0.48226 0.53210
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.97288 0.96358 0.95285 0.90023 0.89607 0.86739 0.85022
## Proportion of Variance 0.04507 0.04421 0.04323 0.03859 0.03824 0.03583 0.03442
## Cumulative Proportion  0.57717 0.62139 0.66462 0.70321 0.74145 0.77728 0.81170
##                          PC15   PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.8374 0.8337 0.80356 0.74483 0.70960 0.68001 0.62573
## Proportion of Variance 0.0334 0.0331 0.03075 0.02642 0.02398 0.02202 0.01864
## Cumulative Proportion  0.8451 0.8782 0.90894 0.93536 0.95934 0.98136 1.00000

Let’s consider three simple approaches, which may be of guidance for deciding the number of relevant principal components.(Jolliffe, 2002)

the visual examination of a scree plot
the variance explained criteria
the Kaiser rule.

Visual examination of a scree plot

Scree Plot

fviz_eig(diabetes_pca,  choice='eigenvalue',barfill = "goldenrod4",barcolor = "#77AA55")

Cumulative Var

plot(summary(diabetes_pca)$importance[3,], ylab="Cumulative Variance", main="Cumulative Variance")

The scree plot indicates that we should choose three (3) principal components to represent our dataset, which explains only 32% of the variance in the dataset.

Variance explained criteria

Another approach to decide on the number of principal components is to set a threshold, between 70% and 90%, and stop when the first k components account for a percentage of total variation greater than this threshold (Jolliffe 2002). We will work with the lower bound of 70% as choosing a higher value will be appropriate when one or two PCs represent very dominant and rather obvious sources of variation.

diabetes_eigVal <- get_eigenvalue(diabetes_pca)
diabetes_eigVal[, -1]

##        variance.percent cumulative.variance.percent
## Dim.1         17.242012                    17.24201
## Dim.2          8.114846                    25.35686
## Dim.3          6.361974                    31.71883
## Dim.4          5.773092                    37.49192
## Dim.5          5.609734                    43.10166
## Dim.6          5.124052                    48.22571
## Dim.7          4.984543                    53.21025
## Dim.8          4.507156                    57.71741
## Dim.9          4.421320                    62.13873
## Dim.10         4.323462                    66.46219
## Dim.11         3.859110                    70.32130
## Dim.12         3.823562                    74.14486
## Dim.13         3.582719                    77.72758
## Dim.14         3.442238                    81.16982
## Dim.15         3.339630                    84.50945
## Dim.16         3.309812                    87.81926
## Dim.17         3.074794                    90.89406
## Dim.18         2.641750                    93.53581
## Dim.19         2.397762                    95.93357
## Dim.20         2.201971                    98.13554
## Dim.21         1.864460                   100.00000

Here, the first 11 components account for 71% of variation.Thus, indicating we choose 11 PCs.

Kaiser’s rule

Kaiser’s rule retains only those PCs whose Eigenvalue exceeds 1. The rule is based on the idea that any PC with Eigenvalue less than 1 contains less information than one of the original variables and is not worth retaining.

diabetes_eigVal[, 1]

##  [1] 3.6208226 1.7041177 1.3360146 1.2123493 1.1780441 1.0760509 1.0467541
##  [8] 0.9465028 0.9284771 0.9079270 0.8104131 0.8029481 0.7523710 0.7228700
## [15] 0.7013223 0.6950606 0.6457068 0.5547674 0.5035301 0.4624138 0.3915365

scree.plot(diabetes_eigVal[,1], title = "EigenValue", type = "E",  simu = "P")

Following the Kaiser’s rule, we can retain seven (7) PCs.

Component Loadings show how the original variables contribute to creating the principal component.

The higher the component loadings, the more important that variable is to the component.

PC1_PC2

fviz_pca_var(diabetes_pca, col.var="contrib",axes = c(1, 2))+
  labs(title="PC1 and PC2") +
  scale_color_gradient2(low="#77AA55", mid="#BB0066", 
                        high="black", midpoint=5)

PC2_PC3

fviz_pca_var(diabetes_pca, col.var="contrib",axes = c(2, 3))+
  labs(title="PC2 and PC3") +
  scale_color_gradient2(low="#77AA55", mid="#BB0066", 
                        high="black", midpoint=5)

The biplot is a very popular way for visualizing results from PCA, as it combines both, the principal component scores and the loading vectors in a single chart.

Biplot

fviz_pca_biplot(diabetes_pca, col.ind = "goldenrod4", col.var = "black")

Individual_PCA

fviz_pca_ind(diabetes_pca, col.ind="cos2", geom = "point", gradient.cols = c("green", "yellow", "red"))

Let’s explore the contribution of each variable to the Principal Components

PC1:PC4

PC_1_Contrib <- fviz_contrib(diabetes_pca, "var", axes=1, fill = "goldenrod4",color = "#77AA55") 
PC_2_Contrib <- fviz_contrib(diabetes_pca, "var", axes=2, fill = "goldenrod4",color = "#77AA55") 
PC_3_Contrib <- fviz_contrib(diabetes_pca, "var", axes=3, fill = "goldenrod4",color = "#77AA55")
PC_4_Contrib <- fviz_contrib(diabetes_pca, "var", axes=4, fill = "goldenrod4",color = "#77AA55")
grid.arrange(PC_1_Contrib,PC_2_Contrib,PC_3_Contrib,PC_4_Contrib)

PC5:PC11

PC_56_Contrib <- fviz_contrib(diabetes_pca, "var", axes=5:6, fill = "goldenrod4",color = "#77AA55") 
PC_78_Contrib <- fviz_contrib(diabetes_pca, "var", axes=7:8, fill = "goldenrod4",color = "#77AA55") 
PC_910_Contrib <- fviz_contrib(diabetes_pca, "var", axes=9:10, fill = "goldenrod4",color = "#77AA55")
PC_11_Contrib <- fviz_contrib(diabetes_pca, "var", axes=11, fill = "goldenrod4",color = "#77AA55")
grid.arrange(PC_56_Contrib,PC_78_Contrib,PC_910_Contrib,PC_11_Contrib)

GenHlth is the most important in PC1, and Age,Fruits,BMI,Sex,AnyHealthCare,HighAlcoholConsump,Stroke, CholCheck,Smoker in PC2 to PC 11 respectively. Other important variables observed are HighChol,Education.

Rotated PCA

diabetes_pca_r<-principal(diabetes_df_norm, nfactors=11, rotate="varimax")

Complexity and Uniqueness

A perfect simple structure solution has a complexity of 1 in that each item would only load on one factor. Uniquenesses represents the proportion of variance that is not shared with other variables. In PCA we want it be low, because then it is easier to reduce the space to a smaller number of dimensions. A uniqueness of 0.20 suggests that 20% or that variable’s variance is not shared with other variables in the overall factor model.

plot(diabetes_pca_r$complexity, diabetes_pca_r$uniqueness)
pointLabel(diabetes_pca_r$complexity, diabetes_pca_r$uniqueness, labels=names(diabetes_pca_r$uniqueness), cex=0.8) 
abline(h=c(0.4), lty=3, col=2)
abline(v=c(1.9), lty=3, col=2)

The variables below might be problematic for analysis and should therefore be excluded.

diabetes_va<-data.frame(complex=diabetes_pca_r$complexity, unique=diabetes_pca_r$uniqueness)
diabetes_va.worst<-diabetes_va[diabetes_va$complex>1.9 & diabetes_va$unique>0.4,]
diabetes_va.worst

##                       complex    unique
## HeartDiseaseorAttack 2.167463 0.4734780
## DiffWalk             3.177633 0.4321536

MultiDimensional Scaling (MDS)

#Distance between Units
diabetes_dist <- dist(t(diabetes_df))
as.matrix(diabetes_dist)[1:6, 1:6]

##               HighBP   HighChol  CholCheck      BMI     Smoker     Stroke
## HighBP       0.00000   49.10193   55.50676 2525.932   57.00877   60.53098
## HighChol    49.10193    0.00000   57.13143 2529.282   56.89464   59.95832
## CholCheck   55.50676   57.13143    0.00000 2493.432   61.00000   80.51708
## BMI       2525.93230 2529.28152 2493.43197    0.000 2535.02229 2568.28231
## Smoker      57.00877   56.89464   61.00000 2535.022    0.00000   57.11392
## Stroke      60.53098   59.95832   80.51708 2568.282   57.11392    0.00000

diabetes_mds <- cmdscale(diabetes_dist, k=2)
summary(diabetes_mds)

##        V1                V2         
##  Min.   :-272.95   Min.   :-135.13  
##  1st Qu.:-239.73   1st Qu.: -54.50  
##  Median :-221.04   Median : -48.26  
##  Mean   :   0.00   Mean   :   0.00  
##  3rd Qu.:  58.75   3rd Qu.: -43.17  
##  Max.   :2294.53   Max.   : 736.38

Visualization

Plot 1

plot(diabetes_mds)
abline(v=c(-500, 500), lty=3, col=2) 
abline(h=c(-200, 200), lty=3, col=2)

Plot 2

x.outlier<-which(diabetes_mds[,1]>500 | diabetes_mds[,1]<(-500))
y.outlier<-which(diabetes_mds[,2]>200 | diabetes_mds[,2]<(-200))
outlier.all<-c(x.outlier, y.outlier)
outlier_uni<-unique(outlier.all)
plot(diabetes_mds) 
abline(v=c(-500, 500), lty=3, col=2)
abline(h=c(-200, 200), lty=3, col=2) 
points(diabetes_mds[outlier_uni,], pch=21, bg="red", cex=2)
text(diabetes_mds[outlier_uni,]-5, labels=rownames(diabetes_mds[outlier_uni,]))

Conclusion

This article aimed to reduce the dimension of the Diabetes Health Indicators Dataset which has a total of 21 features, using Principal Component Analysis (PCA) and Multidimensional Scaling.

The result of the analysis tells that the dataset can be represented by 11 PCs which accounts for 72% variability in the dataset. Also, rotated PCAs pointed out some variables(HeartDiseaseorAttack and DiffWalk) to be excluded based on uniqueness and complexity. MDS was also explored showing similarities and dissimilarities between variables based on the Euclidean distance metric.

References

Alex Teboul(2021) Diabetes Health Indicators Dataset. Retrieved from Kaggle website https://www.kaggle.com/alexteboul/diabetes-health-indicators-dataset?select=diabetes_binary_5050split_health_indicators_BRFSS2015.csv

https://rpubs.com/williamsurles/310847

I.T. Jolliffe (2002) Principal Component Analysis. Retrieved from http://cda.psych.uiuc.edu/statistical_learning_course/Jolliffe%20I.%20Principal%20Component%20Analysis%20(2ed.,%20Springer,%202002)(518s)_MVsa_.pdf

https://www3.cs.stonybrook.edu/~mueller/teaching/cse564/Lec%2017%20-%20Principal%20Component%20Analysis.pdf

Hartmann, K., Krois, J., Waske, B. (2018): E-Learning Project SOGA: Statistics and Geospatial Data Analysis. Department of Earth Sciences, Freie Universitaet Berlin. Retrieved from https://www.geo.fu-berlin.de/en/v/soga/Geodata-analysis/Principal-Component-Analysis/principal-components-basics/Reconstruction-of-the-original-data/index.html

https://easystats.github.io/parameters/reference/principal_components.html

https://www.displayr.com/what-is-multidimensional-scaling-mds/

https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/principal-component-analysis-pca/