This article aims to use Principal Component Analysis (PCA) and Multidimensional scaling (MDS) for dimension reduction of the Diabetes Health Indicators Dataset cleaned from The Behavioral Risk Factor Surveillance System (BRFSS) 2015 dataset. The cleaned dataset can be found on Kaggle website (Alex, 2021).
Principal Component Analysis (PCA) is an unsupervised, non-parametric statistical technique primarily used for dimension reduction in machine learning. It considers the total variance in the data and transforms the original variables into a smaller set of linear combinations.
Multidimensional scaling (MDS) is a technique for visualizing distances between objects, where the distance is known between pairs of the objects
Dimension Reduction is a method to decrease the number of features to describe an observation while maintaining the maximum information content.
diabetes_data <- read.csv("diabetes_BRFSS2015.csv")
sprintf("The original dataset has %s observations with a total of %s features.", nrow(diabetes_data), ncol(diabetes_data))
## [1] "The original dataset has 70692 observations with a total of 21 features."
HighBP: 0 = no high BP 1 = high BP
HighChol: 0 = no high cholesterol 1 = high cholesterol
CholCheck: 0 = no cholesterol check in 5 years 1 = yes cholesterol check in 5 years
BMI: Body Mass Index
MentHlth: days of poor mental health scale 1-30 days
DiffWalk: difficulty walking or climbing stairs? 0 = no 1 = yes
Age: 13-level age category 1=18-24, 5=40-44, 6=45-49, 8=55-59, 13=80 or older
Income: scale 1-8; 1 = less than $10,000, 8 = $75,000 or more
Education: scale 1-6 1 = Never attended school/only kindergarten 2 = elementary
Other variables description can be seen at https://www.cdc.gov/brfss/annual_data/2015/pdf/codebook15_llcp.pdf
diabetes_df <- diabetes_data[sample(1:nrow(diabetes_data),7069,replace=F), ]
sprintf("The new dataset has %s observations with a total of %s features.", nrow(diabetes_df), ncol(diabetes_df))
## [1] "The new dataset has 7069 observations with a total of 21 features."
library(Hmisc)
library(corrplot)
library(psych)
library(psy)
library(gridExtra)
library(ggplot2)
library(caret)
library(factoextra)
library(maptools)
library(dplyr)
str(diabetes_df)
## 'data.frame': 7069 obs. of 21 variables:
## $ HighBP : int 0 1 0 0 0 1 1 0 1 0 ...
## $ HighChol : int 0 1 0 0 0 1 0 0 1 1 ...
## $ CholCheck : int 1 1 1 1 1 1 1 1 1 1 ...
## $ BMI : int 22 35 22 24 27 28 43 23 21 29 ...
## $ Smoker : int 0 0 0 1 0 0 1 0 1 1 ...
## $ Stroke : int 0 0 0 0 0 0 0 0 0 0 ...
## $ HeartDiseaseorAttack: int 0 0 0 0 0 0 0 0 1 0 ...
## $ PhysActivity : int 1 0 0 1 1 0 1 1 1 1 ...
## $ Fruits : int 1 1 1 1 0 0 1 1 1 1 ...
## $ Veggies : int 1 1 1 1 1 0 1 1 1 1 ...
## $ HvyAlcoholConsump : int 0 0 0 0 0 0 0 1 0 0 ...
## $ AnyHealthcare : int 1 1 1 1 1 0 1 1 1 1 ...
## $ NoDocbcCost : int 0 1 0 0 0 0 0 0 0 0 ...
## $ GenHlth : int 3 4 1 1 2 2 4 2 4 2 ...
## $ MentHlth : int 0 30 0 0 1 0 0 0 5 0 ...
## $ PhysHlth : int 2 7 0 0 0 5 0 0 2 0 ...
## $ DiffWalk : int 0 0 0 0 0 0 1 0 1 0 ...
## $ Sex : int 1 0 0 0 0 1 1 0 0 0 ...
## $ Age : int 11 8 9 11 3 6 11 10 13 4 ...
## $ Education : int 6 5 6 6 6 4 5 6 5 5 ...
## $ Income : int 7 2 8 7 8 8 5 6 7 7 ...
sprintf("The dataset contain %s missing values .",sum(is.na(diabetes_df)))
## [1] "The dataset contain 0 missing values ."
preprocess <- preProcess(diabetes_df, method=c("center", "scale"))
diabetes_df_norm <- predict(preprocess, diabetes_df)
cor_matrix <- cor(diabetes_df_norm)
corrplot(cor_matrix, type = "upper", order = "hclust", tl.col = "black", tl.cex = 0.6)
diabetes_pca <- prcomp(diabetes_df_norm, center=FALSE, scale=FALSE)
summary(diabetes_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.9028 1.30542 1.15586 1.10107 1.0854 1.03733 1.02311
## Proportion of Variance 0.1724 0.08115 0.06362 0.05773 0.0561 0.05124 0.04985
## Cumulative Proportion 0.1724 0.25357 0.31719 0.37492 0.4310 0.48226 0.53210
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.97288 0.96358 0.95285 0.90023 0.89607 0.86739 0.85022
## Proportion of Variance 0.04507 0.04421 0.04323 0.03859 0.03824 0.03583 0.03442
## Cumulative Proportion 0.57717 0.62139 0.66462 0.70321 0.74145 0.77728 0.81170
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.8374 0.8337 0.80356 0.74483 0.70960 0.68001 0.62573
## Proportion of Variance 0.0334 0.0331 0.03075 0.02642 0.02398 0.02202 0.01864
## Cumulative Proportion 0.8451 0.8782 0.90894 0.93536 0.95934 0.98136 1.00000
the visual examination of a scree plot
the variance explained criteria
the Kaiser rule.
fviz_eig(diabetes_pca, choice='eigenvalue',barfill = "goldenrod4",barcolor = "#77AA55")
plot(summary(diabetes_pca)$importance[3,], ylab="Cumulative Variance", main="Cumulative Variance")
The scree plot indicates that we should choose three (3) principal components to represent our dataset, which explains only 32% of the variance in the dataset.
Another approach to decide on the number of principal components is to set a threshold, between 70% and 90%, and stop when the first k components account for a percentage of total variation greater than this threshold (Jolliffe 2002). We will work with the lower bound of 70% as choosing a higher value will be appropriate when one or two PCs represent very dominant and rather obvious sources of variation.
diabetes_eigVal <- get_eigenvalue(diabetes_pca)
diabetes_eigVal[, -1]
## variance.percent cumulative.variance.percent
## Dim.1 17.242012 17.24201
## Dim.2 8.114846 25.35686
## Dim.3 6.361974 31.71883
## Dim.4 5.773092 37.49192
## Dim.5 5.609734 43.10166
## Dim.6 5.124052 48.22571
## Dim.7 4.984543 53.21025
## Dim.8 4.507156 57.71741
## Dim.9 4.421320 62.13873
## Dim.10 4.323462 66.46219
## Dim.11 3.859110 70.32130
## Dim.12 3.823562 74.14486
## Dim.13 3.582719 77.72758
## Dim.14 3.442238 81.16982
## Dim.15 3.339630 84.50945
## Dim.16 3.309812 87.81926
## Dim.17 3.074794 90.89406
## Dim.18 2.641750 93.53581
## Dim.19 2.397762 95.93357
## Dim.20 2.201971 98.13554
## Dim.21 1.864460 100.00000
Here, the first 11 components account for 71% of variation.Thus, indicating we choose 11 PCs.
Kaiser’s rule retains only those PCs whose Eigenvalue exceeds 1. The rule is based on the idea that any PC with Eigenvalue less than 1 contains less information than one of the original variables and is not worth retaining.
diabetes_eigVal[, 1]
## [1] 3.6208226 1.7041177 1.3360146 1.2123493 1.1780441 1.0760509 1.0467541
## [8] 0.9465028 0.9284771 0.9079270 0.8104131 0.8029481 0.7523710 0.7228700
## [15] 0.7013223 0.6950606 0.6457068 0.5547674 0.5035301 0.4624138 0.3915365
scree.plot(diabetes_eigVal[,1], title = "EigenValue", type = "E", simu = "P")
Following the Kaiser’s rule, we can retain seven (7) PCs.
The higher the component loadings, the more important that variable is to the component.
fviz_pca_var(diabetes_pca, col.var="contrib",axes = c(1, 2))+
labs(title="PC1 and PC2") +
scale_color_gradient2(low="#77AA55", mid="#BB0066",
high="black", midpoint=5)
fviz_pca_var(diabetes_pca, col.var="contrib",axes = c(2, 3))+
labs(title="PC2 and PC3") +
scale_color_gradient2(low="#77AA55", mid="#BB0066",
high="black", midpoint=5)
fviz_pca_biplot(diabetes_pca, col.ind = "goldenrod4", col.var = "black")
fviz_pca_ind(diabetes_pca, col.ind="cos2", geom = "point", gradient.cols = c("green", "yellow", "red"))
PC_1_Contrib <- fviz_contrib(diabetes_pca, "var", axes=1, fill = "goldenrod4",color = "#77AA55")
PC_2_Contrib <- fviz_contrib(diabetes_pca, "var", axes=2, fill = "goldenrod4",color = "#77AA55")
PC_3_Contrib <- fviz_contrib(diabetes_pca, "var", axes=3, fill = "goldenrod4",color = "#77AA55")
PC_4_Contrib <- fviz_contrib(diabetes_pca, "var", axes=4, fill = "goldenrod4",color = "#77AA55")
grid.arrange(PC_1_Contrib,PC_2_Contrib,PC_3_Contrib,PC_4_Contrib)
PC_56_Contrib <- fviz_contrib(diabetes_pca, "var", axes=5:6, fill = "goldenrod4",color = "#77AA55")
PC_78_Contrib <- fviz_contrib(diabetes_pca, "var", axes=7:8, fill = "goldenrod4",color = "#77AA55")
PC_910_Contrib <- fviz_contrib(diabetes_pca, "var", axes=9:10, fill = "goldenrod4",color = "#77AA55")
PC_11_Contrib <- fviz_contrib(diabetes_pca, "var", axes=11, fill = "goldenrod4",color = "#77AA55")
grid.arrange(PC_56_Contrib,PC_78_Contrib,PC_910_Contrib,PC_11_Contrib)
GenHlth is the most important in PC1, and Age,Fruits,BMI,Sex,AnyHealthCare,HighAlcoholConsump,Stroke, CholCheck,Smoker in PC2 to PC 11 respectively. Other important variables observed are HighChol,Education.
diabetes_pca_r<-principal(diabetes_df_norm, nfactors=11, rotate="varimax")
A perfect simple structure solution has a complexity of 1 in that each item would only load on one factor. Uniquenesses represents the proportion of variance that is not shared with other variables. In PCA we want it be low, because then it is easier to reduce the space to a smaller number of dimensions. A uniqueness of 0.20 suggests that 20% or that variable’s variance is not shared with other variables in the overall factor model.
plot(diabetes_pca_r$complexity, diabetes_pca_r$uniqueness)
pointLabel(diabetes_pca_r$complexity, diabetes_pca_r$uniqueness, labels=names(diabetes_pca_r$uniqueness), cex=0.8)
abline(h=c(0.4), lty=3, col=2)
abline(v=c(1.9), lty=3, col=2)
The variables below might be problematic for analysis and should therefore be excluded.
diabetes_va<-data.frame(complex=diabetes_pca_r$complexity, unique=diabetes_pca_r$uniqueness)
diabetes_va.worst<-diabetes_va[diabetes_va$complex>1.9 & diabetes_va$unique>0.4,]
diabetes_va.worst
## complex unique
## HeartDiseaseorAttack 2.167463 0.4734780
## DiffWalk 3.177633 0.4321536
#Distance between Units
diabetes_dist <- dist(t(diabetes_df))
as.matrix(diabetes_dist)[1:6, 1:6]
## HighBP HighChol CholCheck BMI Smoker Stroke
## HighBP 0.00000 49.10193 55.50676 2525.932 57.00877 60.53098
## HighChol 49.10193 0.00000 57.13143 2529.282 56.89464 59.95832
## CholCheck 55.50676 57.13143 0.00000 2493.432 61.00000 80.51708
## BMI 2525.93230 2529.28152 2493.43197 0.000 2535.02229 2568.28231
## Smoker 57.00877 56.89464 61.00000 2535.022 0.00000 57.11392
## Stroke 60.53098 59.95832 80.51708 2568.282 57.11392 0.00000
diabetes_mds <- cmdscale(diabetes_dist, k=2)
summary(diabetes_mds)
## V1 V2
## Min. :-272.95 Min. :-135.13
## 1st Qu.:-239.73 1st Qu.: -54.50
## Median :-221.04 Median : -48.26
## Mean : 0.00 Mean : 0.00
## 3rd Qu.: 58.75 3rd Qu.: -43.17
## Max. :2294.53 Max. : 736.38
plot(diabetes_mds)
abline(v=c(-500, 500), lty=3, col=2)
abline(h=c(-200, 200), lty=3, col=2)
x.outlier<-which(diabetes_mds[,1]>500 | diabetes_mds[,1]<(-500))
y.outlier<-which(diabetes_mds[,2]>200 | diabetes_mds[,2]<(-200))
outlier.all<-c(x.outlier, y.outlier)
outlier_uni<-unique(outlier.all)
plot(diabetes_mds)
abline(v=c(-500, 500), lty=3, col=2)
abline(h=c(-200, 200), lty=3, col=2)
points(diabetes_mds[outlier_uni,], pch=21, bg="red", cex=2)
text(diabetes_mds[outlier_uni,]-5, labels=rownames(diabetes_mds[outlier_uni,]))
This article aimed to reduce the dimension of the Diabetes Health Indicators Dataset which has a total of 21 features, using Principal Component Analysis (PCA) and Multidimensional Scaling.
The result of the analysis tells that the dataset can be represented by 11 PCs which accounts for 72% variability in the dataset. Also, rotated PCAs pointed out some variables(HeartDiseaseorAttack and DiffWalk) to be excluded based on uniqueness and complexity. MDS was also explored showing similarities and dissimilarities between variables based on the Euclidean distance metric.
Alex Teboul(2021) Diabetes Health Indicators Dataset. Retrieved from Kaggle website https://www.kaggle.com/alexteboul/diabetes-health-indicators-dataset?select=diabetes_binary_5050split_health_indicators_BRFSS2015.csv
https://rpubs.com/williamsurles/310847
I.T. Jolliffe (2002) Principal Component Analysis. Retrieved from http://cda.psych.uiuc.edu/statistical_learning_course/Jolliffe%20I.%20Principal%20Component%20Analysis%20(2ed.,%20Springer,%202002)(518s)_MVsa_.pdf
Hartmann, K., Krois, J., Waske, B. (2018): E-Learning Project SOGA: Statistics and Geospatial Data Analysis. Department of Earth Sciences, Freie Universitaet Berlin. Retrieved from https://www.geo.fu-berlin.de/en/v/soga/Geodata-analysis/Principal-Component-Analysis/principal-components-basics/Reconstruction-of-the-original-data/index.html
https://easystats.github.io/parameters/reference/principal_components.html
https://www.displayr.com/what-is-multidimensional-scaling-mds/