Dimension Reduction methods

Introduction

Dimension reduction is a process of transforming data from high-dimensional to low-dimensional space, while retaining as much information as possible. There are multiple dimensionality reduction techniques, with one of the most popular being Principal Component Analysis (PCA).

This paper aims so apply PCA to a dataset about diabetes, downloaded form Kaggle.

Dataset

This dataset is a collection of patient records, selected from a larger database. All patients in the dataset are women over 21 years old of Pima Indian heritage.

df <- read.csv("diabetes.csv")
head(df)

##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0

str(df)

## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

Let’s take a closer look at the features:

* Pregnancies - number of times pregnant
* Glucose - Plasma glucose concentration
* Blood Pressure - in mm Hg
* SkinThickness - triceps skin fold thickness in mm
* Insulin - 2-hour serum insulin in mu U/ml
* BMI - body mass index
* DiabetesPedigreeFunction - scores the probability of diabetes based on family history
* Age - in years
* Outcome - binary variable: 0 - no diabetes, 1 - diabetes diagnosis

Cleaning data

All variables in this dataset are numerical, which means we don’t have to transform them. In order to be able to perform dimension reduction on the dataset it is important to first check, if there are any missing values.

sum(is.na(df))

## [1] 0

Numerical data

Using summary() funtion in R, we can easily get information about most important statistics on numerical variables in the dataset.

summary(df)

##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

Age ranges from 21 to 81, with the average being just over 33 years old. Average women has been pregnant 3.845 times. Glucose level is on average around 121, but for most women it falls between 99 and 140. Mean blood pressure is equal to 69.11 mm Hg. The thickness of skin fold ranges from 0 to 99 mm, with the average equal to 20.54. Average insulin dose is 79.8 mu U/ml. BMI shows a higher concentration in the 30s range, as the 3rd quartile is at 36.6. Average diabetes pedigree function score is 0.4719.

library(ggplot2)
library(reshape2)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

num_plot <- df %>%
  select(-Outcome) %>% 
  melt()

## No id variables; using all as measure variables

options(repr.plot.width=12, repr.plot.height=20)

ggplot(num_plot, aes(x = value, fill = variable)) +
  geom_histogram(bins = 30, alpha = 0.7, show.legend = FALSE) +
  facet_wrap(~ variable, scales = "free", nrow = 3) +
  theme_minimal()

Histograms show the distribution of the numerical variables from the dataset. Pregnancy distribution is right skewed, just like DiabetesPedigreeFuntion and Age. Glucose seems normally distributed, but slightly right-skewed. The case for Blood Pressure is similar, with a peak around normal blood pressure levels. Skin thickness and Insulin also appear right-skewed, with most values on the lower end. BMI has a somewhat normal distribution, with fewer high values.

Just by looking at the histograms of numerical variables it may be hard to decide if data is normally distributed. However this will be relevant later on, when we will want to check correlation between variables.

Bar chart below shows the number of people with (1) and without (0) diabetes. Majority of women in the dataset, 500, are healthy, while around 270 have diabetes.

ggplot(df, aes(x = factor(Outcome))) + 
  geom_bar(fill = "steelblue") + 
  labs(title = "Diabetes Outcome Distribution", x = "Diabetes Diagnosis", y = "Count") +
  theme_minimal()

Correlation

A necessary step before checking correlation is to verify variable distribution, which will help us choose a proper measure. Generally Pearsons correlation is used for normally distributed data and Spearmans or Kendalls are used otherwise, depending on how large the dataset is.

pval <- sapply(df, function(x) shapiro.test(x)$p.value)
bad <- sum(pval < 0.05)

if (bad > 0) {
  cat("Variables not normally distributed:", bad, "\n")
} else {
  print(TRUE)
}

## Variables not normally distributed: 9

Test results show us that none of the numerical variables are normally distributed. This means we can’t use Pearsons correlation and as the dataset is small, we should use Kendalls Tau correlation.

library(corrplot)

## Warning: pakiet 'corrplot' został zbudowany w wersji R 4.4.2

## corrplot 0.95 loaded

corr<-cor(df, method = "kendall")
print(corr)

##                           Pregnancies    Glucose BloodPressure SkinThickness
## Pregnancies               1.000000000 0.09132309   0.135439668   -0.06440103
## Glucose                   0.091323093 1.00000000   0.159960649    0.03904583
## BloodPressure             0.135439668 0.15996065   1.000000000    0.09486767
## SkinThickness            -0.064401027 0.03904583   0.094867668    1.00000000
## Insulin                  -0.096416854 0.16364459  -0.003682299    0.42006645
## BMI                       0.004183211 0.15586214   0.205222464    0.33153190
## DiabetesPedigreeFunction -0.029958996 0.06187092   0.019448359    0.12645709
## Age                       0.458272007 0.19651008   0.246055669   -0.04475425
## Outcome                   0.170370465 0.39056483   0.119206140    0.07629714
##                               Insulin         BMI DiabetesPedigreeFunction
## Pregnancies              -0.096416854 0.004183211              -0.02995900
## Glucose                   0.163644593 0.155862144               0.06187092
## BloodPressure            -0.003682299 0.205222464               0.01944836
## SkinThickness             0.420066448 0.331531901               0.12645709
## Insulin                   1.000000000 0.141587453               0.16165188
## BMI                       0.141587453 1.000000000               0.09464399
## DiabetesPedigreeFunction  0.161651884 0.094643994               1.00000000
## Age                      -0.080175601 0.088678453               0.02804228
## Outcome                   0.058531153 0.253676046               0.14335873
##                                  Age    Outcome
## Pregnancies               0.45827201 0.17037047
## Glucose                   0.19651008 0.39056483
## BloodPressure             0.24605567 0.11920614
## SkinThickness            -0.04475425 0.07629714
## Insulin                  -0.08017560 0.05853115
## BMI                       0.08867845 0.25367605
## DiabetesPedigreeFunction  0.02804228 0.14335873
## Age                       1.00000000 0.25736317
## Outcome                   0.25736317 1.00000000

corrplot(corr)

The correlation matrix shows a couple of interesting correlations:

* Age & Pregnancies: 0.46 - older women tend to have been pregnant more times, which is logical
* Glucose & Outcome: 0.39 - higher glucose levels are associated with diabetes, so this finding amkes sense
* Age & Outcome: 0.26 - the older a person gets, the more likley they are to develop diabetes
* BMI & Outcome: 0.25 - higher BMI increases risk of diabetes
* Insulin & Outcome: 0.06 - a suprisingly low value
* DiabetesPedigreeFunction & Outcome: 0.14 - family history and predisposition has some effect, bit it's not dominant

There are many conclusions that could be drawn from the correlation matrix. Most importantly, glucose is strongly related to diabetes diagnosis, while variables like Age, BMI and Pregnancies have moderate effects.

PCA

df_pca <- df %>% select(-Outcome)

After exploring relationships between variables, we can move on to actual dimension reduction: PCA.

library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

pca <- prcomp(df_pca, center=TRUE, scale=TRUE)
summary(pca)

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.4472 1.3158 1.0147 0.9357 0.87312 0.82621 0.64793
## Proportion of Variance 0.2618 0.2164 0.1287 0.1094 0.09529 0.08533 0.05248
## Cumulative Proportion  0.2618 0.4782 0.6069 0.7163 0.81164 0.89697 0.94944
##                            PC8
## Standard deviation     0.63597
## Proportion of Variance 0.05056
## Cumulative Proportion  1.00000

PCA results above show how much variance each component captures from the original dataset. PC1 captures only 26.18% of the variance and we capture 60% of the variance by PC3.

fviz_eig(pca)

A scree plot visualizes the importance of each component. From the one above we can see, that there’s a “drop” were variance decreases significantly, suggesting which components we should choose. In this case PC1 and PC2 are definitely significant. The biggest drop happens between PC2 and PC3. Thus retaining the first two or three PCs would be most informative.

fviz_pca_var(pca, col.var = "steelblue")

PCA biplot above visualizes how variables contribute to PCs. Each arrow represents a variable: the direction and length of the arrows tells us how strongly that variable influences the PCs. Longer arrows mean stronger contributions, while closeness of arrows to each other indicate if variables are correlated or not.

In this case for example Pregnancies and Age are correlated, just like BloodPressure and Glucose, which we have established earlier.

We can also check how variables contribute to different PCs. First four PCs explain over 70% of the variance.

library(pdp)

## Warning: pakiet 'pdp' został zbudowany w wersji R 4.4.2

## 
## Dołączanie pakietu: 'pdp'

## Następujący obiekt został zakryty z 'package:purrr':
## 
##     partial

library(gridExtra)

## 
## Dołączanie pakietu: 'gridExtra'

## Następujący obiekt został zakryty z 'package:dplyr':
## 
##     combine

PC1 <- fviz_contrib(pca, choice = "var", axes=1)
PC2 <- fviz_contrib(pca, choice = "var", axes=2)
PC3 <- fviz_contrib(pca, choice = "var", axes=3)
PC4 <- fviz_contrib(pca, choice = "var", axes=4)

grid.arrange(PC1, PC2, PC3, PC4)

Each plot represents how each variable contributes to a specific PC. BMI, SkinThickness and Insulin are the highest contributors to PC1. Age and Pregnancies explain most variance in PC2. In PC3 BloodPressure, Glucose and family history of diabetes are biggest contributors in PC3. Family history is significantly more important than other variables in PC4.

pca_scores <- as.data.frame(pca$x)
pca_scores$Outcome <- factor(df$Outcome, levels = c(0,1), labels = c("Healthy", "Diabetes"))

ggplot(pca_scores, aes(x = PC1, y = PC2, color = Outcome)) +
  geom_point(alpha = 0.7) +
  labs(title = "PCA Scatter Plot") +
  scale_color_manual(values = c("steelblue", "red")) + 
  theme_minimal()

The scatter plot above visualizes the relationship between first two components. PC1 and PC2 together explain over 47% of the variance. Each point represents an individual patient. Data points on the plot are colored based on whether or not the patient suffers from diabetes.

Conclusions

This papers aim was to explore dimension reduction techniques, specifically Principal Components Analysis. The goal of this kind of algorithm is to reduce the dimensions of the data, while retaining as much variance as possible. This makes it easier to visualize and analyze the data.

PCA was applied to a diabetes dataset. This method proved to be effective, as first four components captured over 70% of total variance. Variables like BMI, SkinThickness and Age proved to be most important in contributing to the first PCS. This might likely reflect overall health and genetic factors that influence diabetes.

Dimension Reduction

Zuzanna Herniczek

2025-02-09