Dimension reduction is a process of transforming data from high-dimensional to low-dimensional space, while retaining as much information as possible. There are multiple dimensionality reduction techniques, with one of the most popular being Principal Component Analysis (PCA).
This paper aims so apply PCA to a dataset about diabetes, downloaded form Kaggle.
This dataset is a collection of patient records, selected from a larger database. All patients in the dataset are women over 21 years old of Pima Indian heritage.
df <- read.csv("diabetes.csv")
head(df)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
str(df)
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
Let’s take a closer look at the features:
* Pregnancies - number of times pregnant
* Glucose - Plasma glucose concentration
* Blood Pressure - in mm Hg
* SkinThickness - triceps skin fold thickness in mm
* Insulin - 2-hour serum insulin in mu U/ml
* BMI - body mass index
* DiabetesPedigreeFunction - scores the probability of diabetes based on family history
* Age - in years
* Outcome - binary variable: 0 - no diabetes, 1 - diabetes diagnosis
All variables in this dataset are numerical, which means we don’t have to transform them. In order to be able to perform dimension reduction on the dataset it is important to first check, if there are any missing values.
sum(is.na(df))
## [1] 0
Using summary() funtion in R, we can easily get information about most important statistics on numerical variables in the dataset.
summary(df)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
Age ranges from 21 to 81, with the average being just over 33 years old. Average women has been pregnant 3.845 times. Glucose level is on average around 121, but for most women it falls between 99 and 140. Mean blood pressure is equal to 69.11 mm Hg. The thickness of skin fold ranges from 0 to 99 mm, with the average equal to 20.54. Average insulin dose is 79.8 mu U/ml. BMI shows a higher concentration in the 30s range, as the 3rd quartile is at 36.6. Average diabetes pedigree function score is 0.4719.
library(ggplot2)
library(reshape2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
num_plot <- df %>%
select(-Outcome) %>%
melt()
## No id variables; using all as measure variables
options(repr.plot.width=12, repr.plot.height=20)
ggplot(num_plot, aes(x = value, fill = variable)) +
geom_histogram(bins = 30, alpha = 0.7, show.legend = FALSE) +
facet_wrap(~ variable, scales = "free", nrow = 3) +
theme_minimal()
Histograms show the distribution of the numerical variables from the dataset. Pregnancy distribution is right skewed, just like DiabetesPedigreeFuntion and Age. Glucose seems normally distributed, but slightly right-skewed. The case for Blood Pressure is similar, with a peak around normal blood pressure levels. Skin thickness and Insulin also appear right-skewed, with most values on the lower end. BMI has a somewhat normal distribution, with fewer high values.
Just by looking at the histograms of numerical variables it may be hard to decide if data is normally distributed. However this will be relevant later on, when we will want to check correlation between variables.
Bar chart below shows the number of people with (1) and without (0) diabetes. Majority of women in the dataset, 500, are healthy, while around 270 have diabetes.
ggplot(df, aes(x = factor(Outcome))) +
geom_bar(fill = "steelblue") +
labs(title = "Diabetes Outcome Distribution", x = "Diabetes Diagnosis", y = "Count") +
theme_minimal()
A necessary step before checking correlation is to verify variable distribution, which will help us choose a proper measure. Generally Pearsons correlation is used for normally distributed data and Spearmans or Kendalls are used otherwise, depending on how large the dataset is.
pval <- sapply(df, function(x) shapiro.test(x)$p.value)
bad <- sum(pval < 0.05)
if (bad > 0) {
cat("Variables not normally distributed:", bad, "\n")
} else {
print(TRUE)
}
## Variables not normally distributed: 9
Test results show us that none of the numerical variables are normally distributed. This means we can’t use Pearsons correlation and as the dataset is small, we should use Kendalls Tau correlation.
library(corrplot)
## Warning: pakiet 'corrplot' został zbudowany w wersji R 4.4.2
## corrplot 0.95 loaded
corr<-cor(df, method = "kendall")
print(corr)
## Pregnancies Glucose BloodPressure SkinThickness
## Pregnancies 1.000000000 0.09132309 0.135439668 -0.06440103
## Glucose 0.091323093 1.00000000 0.159960649 0.03904583
## BloodPressure 0.135439668 0.15996065 1.000000000 0.09486767
## SkinThickness -0.064401027 0.03904583 0.094867668 1.00000000
## Insulin -0.096416854 0.16364459 -0.003682299 0.42006645
## BMI 0.004183211 0.15586214 0.205222464 0.33153190
## DiabetesPedigreeFunction -0.029958996 0.06187092 0.019448359 0.12645709
## Age 0.458272007 0.19651008 0.246055669 -0.04475425
## Outcome 0.170370465 0.39056483 0.119206140 0.07629714
## Insulin BMI DiabetesPedigreeFunction
## Pregnancies -0.096416854 0.004183211 -0.02995900
## Glucose 0.163644593 0.155862144 0.06187092
## BloodPressure -0.003682299 0.205222464 0.01944836
## SkinThickness 0.420066448 0.331531901 0.12645709
## Insulin 1.000000000 0.141587453 0.16165188
## BMI 0.141587453 1.000000000 0.09464399
## DiabetesPedigreeFunction 0.161651884 0.094643994 1.00000000
## Age -0.080175601 0.088678453 0.02804228
## Outcome 0.058531153 0.253676046 0.14335873
## Age Outcome
## Pregnancies 0.45827201 0.17037047
## Glucose 0.19651008 0.39056483
## BloodPressure 0.24605567 0.11920614
## SkinThickness -0.04475425 0.07629714
## Insulin -0.08017560 0.05853115
## BMI 0.08867845 0.25367605
## DiabetesPedigreeFunction 0.02804228 0.14335873
## Age 1.00000000 0.25736317
## Outcome 0.25736317 1.00000000
corrplot(corr)
The correlation matrix shows a couple of interesting correlations:
* Age & Pregnancies: 0.46 - older women tend to have been pregnant more times, which is logical
* Glucose & Outcome: 0.39 - higher glucose levels are associated with diabetes, so this finding amkes sense
* Age & Outcome: 0.26 - the older a person gets, the more likley they are to develop diabetes
* BMI & Outcome: 0.25 - higher BMI increases risk of diabetes
* Insulin & Outcome: 0.06 - a suprisingly low value
* DiabetesPedigreeFunction & Outcome: 0.14 - family history and predisposition has some effect, bit it's not dominant
There are many conclusions that could be drawn from the correlation matrix. Most importantly, glucose is strongly related to diabetes diagnosis, while variables like Age, BMI and Pregnancies have moderate effects.
df_pca <- df %>% select(-Outcome)
After exploring relationships between variables, we can move on to actual dimension reduction: PCA.
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
pca <- prcomp(df_pca, center=TRUE, scale=TRUE)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.4472 1.3158 1.0147 0.9357 0.87312 0.82621 0.64793
## Proportion of Variance 0.2618 0.2164 0.1287 0.1094 0.09529 0.08533 0.05248
## Cumulative Proportion 0.2618 0.4782 0.6069 0.7163 0.81164 0.89697 0.94944
## PC8
## Standard deviation 0.63597
## Proportion of Variance 0.05056
## Cumulative Proportion 1.00000
PCA results above show how much variance each component captures from the original dataset. PC1 captures only 26.18% of the variance and we capture 60% of the variance by PC3.
fviz_eig(pca)
A scree plot visualizes the importance of each component. From the one above we can see, that there’s a “drop” were variance decreases significantly, suggesting which components we should choose. In this case PC1 and PC2 are definitely significant. The biggest drop happens between PC2 and PC3. Thus retaining the first two or three PCs would be most informative.
fviz_pca_var(pca, col.var = "steelblue")
PCA biplot above visualizes how variables contribute to PCs. Each arrow represents a variable: the direction and length of the arrows tells us how strongly that variable influences the PCs. Longer arrows mean stronger contributions, while closeness of arrows to each other indicate if variables are correlated or not.
In this case for example Pregnancies and Age are correlated, just like BloodPressure and Glucose, which we have established earlier.
We can also check how variables contribute to different PCs. First four PCs explain over 70% of the variance.
library(pdp)
## Warning: pakiet 'pdp' został zbudowany w wersji R 4.4.2
##
## Dołączanie pakietu: 'pdp'
## Następujący obiekt został zakryty z 'package:purrr':
##
## partial
library(gridExtra)
##
## Dołączanie pakietu: 'gridExtra'
## Następujący obiekt został zakryty z 'package:dplyr':
##
## combine
PC1 <- fviz_contrib(pca, choice = "var", axes=1)
PC2 <- fviz_contrib(pca, choice = "var", axes=2)
PC3 <- fviz_contrib(pca, choice = "var", axes=3)
PC4 <- fviz_contrib(pca, choice = "var", axes=4)
grid.arrange(PC1, PC2, PC3, PC4)
Each plot represents how each variable contributes to a specific PC. BMI, SkinThickness and Insulin are the highest contributors to PC1. Age and Pregnancies explain most variance in PC2. In PC3 BloodPressure, Glucose and family history of diabetes are biggest contributors in PC3. Family history is significantly more important than other variables in PC4.
pca_scores <- as.data.frame(pca$x)
pca_scores$Outcome <- factor(df$Outcome, levels = c(0,1), labels = c("Healthy", "Diabetes"))
ggplot(pca_scores, aes(x = PC1, y = PC2, color = Outcome)) +
geom_point(alpha = 0.7) +
labs(title = "PCA Scatter Plot") +
scale_color_manual(values = c("steelblue", "red")) +
theme_minimal()
The scatter plot above visualizes the relationship between first two components. PC1 and PC2 together explain over 47% of the variance. Each point represents an individual patient. Data points on the plot are colored based on whether or not the patient suffers from diabetes.
This papers aim was to explore dimension reduction techniques, specifically Principal Components Analysis. The goal of this kind of algorithm is to reduce the dimensions of the data, while retaining as much variance as possible. This makes it easier to visualize and analyze the data.
PCA was applied to a diabetes dataset. This method proved to be effective, as first four components captured over 70% of total variance. Variables like BMI, SkinThickness and Age proved to be most important in contributing to the first PCS. This might likely reflect overall health and genetic factors that influence diabetes.