Weekly Assignment #4

Conceptual questions

With respect to hierarchical clustering what are the concepts of linkage and distance metrics?

Hierarchical clustering is based on two main ideas: distance metrics and linkage methods. Distance metrics are mathematical ways to measure how similar or different data points are. Common examples include Euclidean distance, which measures straight-line distance, and Manhattan distance, which adds up the absolute differences between coordinates. These distances are used to create a distance matrix that compares all pairs of data points. Once these distances are known, linkage methods decide how clusters should be joined together. Single linkage uses the shortest distance between two clusters, complete linkage uses the longest, average linkage takes the average of all distances, and centroid linkage measures the distance between cluster centers. Together, these methods determine the shape of a dendrogram, a tree-like diagram that shows how clusters form and how similar they are to each other.

How are the principal components different from the original variables of a data set?

Principal components, created through Principal Component Analysis (PCA), are different from the original variables in a dataset. The original variables are the raw features, such as age, blood pressure, or gene expression levels. In contrast, principal components are new variables that summarize the most important patterns in the data. Each principal component is made by combining the original variables in a way that captures as much variation as possible. The first principal component explains the most variation, and each one after that explains a little less. Unlike the original variables, principal components are uncorrelated with each other and help reduce the number of variables while keeping most of the important information in the data.

What is a similarity and a difference between PCA and UMAP?

Both PCA and UMAP (Uniform Manifold Approximation and Projection) are dimensionality reduction methods used to simplify complex data while keeping important relationships between data points. They help make high-dimensional data easier to visualize and understand by showing it in fewer dimensions. However, they work in different ways. PCA is a linear method that finds directions (called principal components) that capture the most variation in the data, assuming the relationships between variables are linear. UMAP, on the other hand, is a non-linear method that focuses on preserving both local and global trends in the data. This allows it to capture more complex, non-linear patterns. Essentially, PCA is simpler and easier to interpret, while UMAP can reveal more detailed patterns but is less transparent and requires more computation.

Coding questions

For the assignment we will return to the diabetes data set.

Loading the data

I will load the db data set and libraries.

db<-read.csv("diabetes.csv")

# Load Libraries 
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(cowplot)

## 
## Attaching package: 'cowplot'
## 
## The following object is masked from 'package:lubridate':
## 
##     stamp

library(pheatmap)
library(formattable)
library(corrplot)

## corrplot 0.95 loaded

library(mclust)

## Package 'mclust' version 6.1.1
## Type 'citation("mclust")' for citing this R package in publications.
## 
## Attaching package: 'mclust'
## 
## The following object is masked from 'package:purrr':
## 
##     map

library(Rtsne)
library(umap)
library(GGally)

In this step, I will clean the dataset by removing biologically impossible values, specifically 0’s in key variables such as Glucose, BloodPressure, SkinThickness, Insulin, and BMI. I will then add a new column called PatientID to uniquely identify each participant, numbering the rows sequentially from 1 onward. This column will be positioned as the first column in the dataset for easier reference. Finally, I will convert the Outcome variable from numeric values (0 and 1) to descriptive factor levels, labeling 0 as “Healthy” and 1 as “Sick,” which will facilitate interpretation in subsequent analyses such as clustering and PCA. I will also reuse code from Week 2 to make a new column that added the pregnancy categories “no pregnancy”, “a few (1-3)”, “several (4-6)”, “many (7+)”.

#many 0 values present that are not biologically possible
summary(db)

##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

# data set
dbClean<-db %>% 
    #filter just the variable Glucose:BMI 
    #where the values are greater than 0
    # ~ is a function and 
    #. is the place holder for each variable to be checked. 
    filter_at(vars(Glucose:BMI), ~.>0)%>%
   # Add PatientID column (1 to number of rows)
  mutate(PatientID = 1:n()) %>%
  # Move PatientID to the first column
  relocate(PatientID, .before = 1)%>%  
  mutate(Outcome = factor(Outcome, levels = c(0, 1), labels = c("Healthy", "Sick")))%>%
  mutate(pregnancy_cat = cut(Pregnancies,breaks = c(-Inf, 0, 3, 6, Inf),labels = c("no pregnancy", "a few (1-3)", "several (4-6)", "many (7+)"), right = TRUE))


head(dbClean)

##   PatientID Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1         1           1      89            66            23      94 28.1
## 2         2           0     137            40            35     168 43.1
## 3         3           3      78            50            32      88 31.0
## 4         4           2     197            70            45     543 30.5
## 5         5           1     189            60            23     846 30.1
## 6         6           5     166            72            19     175 25.8
##   DiabetesPedigreeFunction Age Outcome pregnancy_cat
## 1                    0.167  21 Healthy   a few (1-3)
## 2                    2.288  33    Sick  no pregnancy
## 3                    0.248  26    Sick   a few (1-3)
## 4                    0.158  53    Sick   a few (1-3)
## 5                    0.398  59    Sick   a few (1-3)
## 6                    0.587  51    Sick several (4-6)

summary(dbClean)

##    PatientID       Pregnancies        Glucose      BloodPressure   
##  Min.   :  1.00   Min.   : 0.000   Min.   : 56.0   Min.   : 24.00  
##  1st Qu.: 98.75   1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00  
##  Median :196.50   Median : 2.000   Median :119.0   Median : 70.00  
##  Mean   :196.50   Mean   : 3.301   Mean   :122.6   Mean   : 70.66  
##  3rd Qu.:294.25   3rd Qu.: 5.000   3rd Qu.:143.0   3rd Qu.: 78.00  
##  Max.   :392.00   Max.   :17.000   Max.   :198.0   Max.   :110.00  
##  SkinThickness      Insulin            BMI        DiabetesPedigreeFunction
##  Min.   : 7.00   Min.   : 14.00   Min.   :18.20   Min.   :0.0850          
##  1st Qu.:21.00   1st Qu.: 76.75   1st Qu.:28.40   1st Qu.:0.2697          
##  Median :29.00   Median :125.50   Median :33.20   Median :0.4495          
##  Mean   :29.15   Mean   :156.06   Mean   :33.09   Mean   :0.5230          
##  3rd Qu.:37.00   3rd Qu.:190.00   3rd Qu.:37.10   3rd Qu.:0.6870          
##  Max.   :63.00   Max.   :846.00   Max.   :67.10   Max.   :2.4200          
##       Age           Outcome          pregnancy_cat
##  Min.   :21.00   Healthy:262   no pregnancy : 56  
##  1st Qu.:23.00   Sick   :130   a few (1-3)  :202  
##  Median :27.00                 several (4-6): 67  
##  Mean   :30.86                 many (7+)    : 67  
##  3rd Qu.:36.00                                    
##  Max.   :81.00

The code successfully filtered out rows with biologically impossible zero values in the selected variables, leaving only valid data for analysis. The PatientID column was added and positioned as the first column, providing a clear identifier for each patient. The Outcome variable was converted into a factor with levels “Healthy” and “Sick,” making the data more interpretable. The pregnancy_cat column was added to the right of the table. Summary statistics show that all selected variables now contain valid, positive values, and the dataset is ready for downstream analyses such as heatmaps, clustering, PCA, and UMAP projections.

For the the following questions, I will keep outcome, pregnancies, and diabetes pedigree function as out of model variables. That means they will not be used for clustering or calculating the PCA. We will use these three variables as annotation columns and to test clusters and PCs to determine if they support the data exploration.

Create a scaled data set.

I will create a scaled version of the dataset using the scale() function to prepare it for clustering and principal component analysis (PCA). Scaling ensures that all variables contribute equally to the analysis, regardless of their original measurement units. I will standardize only the numeric variables relevant to modeling while excluding Outcome, Pregnancies, and DiabetesPedigreeFunction, as these are treated as out-of-model variables. They will instead serve as annotation columns for later exploratory analyses and to help interpret clustering and PCA results. To do this, I will create a dataScaled object containing the cleaned dataset (dbClean), remove the PatientID column title so that only the ID numbers are retained as row names, and select the numeric variables from Glucose to Age, explicitly excluding DiabetesPedigreeFunction since it falls within that range. Finally, I will use the head() function to inspect the first few rows and confirm that the scaling was applied correctly.

# Scaled data set for clustering and PCA
dataScaled <- dbClean %>%
  column_to_rownames("PatientID")%>%
  # select only the numeric variables to scale, excluding Outcome, Pregnancies, DiabetesPedigreeFunction
  select(Glucose:Age) %>% 
  select(-DiabetesPedigreeFunction)%>%
  scale()

# Check the first few rows
head(dataScaled)

##      Glucose BloodPressure SkinThickness    Insulin        BMI        Age
## 1 -1.0896533   -0.37317791    -0.5843629 -0.5221747 -0.7095143 -0.9670632
## 2  0.4657189   -2.45382847     0.5567094  0.1005024  1.4249091  0.2093178
## 3 -1.4460927   -1.65357826     0.2714413 -0.5726620 -0.2968591 -0.4769045
## 4  2.4099341   -0.05307782     1.5076030  3.2559608 -0.3680065  2.1699528
## 5  2.1507054   -0.85332804    -0.5843629  5.8055711 -0.4249245  2.7581434
## 6  1.4054229    0.10697222    -0.9647204  0.1594043 -1.0367925  1.9738893

The code successfully generated a scaled dataset named dataScaled, containing only the numeric variables used for clustering and PCA. Each variable now has a mean of 0 and a standard deviation of 1, allowing for unbiased comparisons between variables of different units, such as glucose concentration and body mass index. By examining the first few rows, we can confirm that the transformation worked as intended, with standardized values centered around zero. This cleaned and normalized dataset is now ready to be used for downstream analyses such as clustering and dimensionality reduction.

Generate four heat maps to explore manhattan and euclidean distance metrics using complete and ward.D linkages. Consider that the data set and the annotation table both need to have the same rownames. Are there patient IDs that could be used?

In this step, I will generate four heat maps to visually explore how different combinations of distance metrics and clustering methods affect the grouping of the data. Specifically, I will compare Manhattan and Euclidean distance metrics in combination with complete and ward.D linkage methods. These choices will help determine how sensitive the clustering results are to the method of measuring similarity and the way clusters are formed. To make the heat maps interpretable, I will create an annotation table using Outcome, Pregnancies, and DiabetesPedigreeFunction so that the patient-level metadata can be displayed alongside the clustered data. I will also ensure that the dataset and the annotation table share identical row names, which in this case correspond to PatientID, to avoid mismatches and errors during visualization. Finally, I will loop through all combinations of distance and linkage parameters, store each resulting heat map, and display them together using plot_grid() for easy comparison.

# Annotation table for heatmaps
anno<- dbClean %>%
  column_to_rownames("PatientID")%>%
  select(Outcome, Pregnancies, DiabetesPedigreeFunction)

# Test heat map
pheatmap(t(dataScaled),annotation_col=anno,show_colnames=F)

# Distance and linkage combinations
distances <- c("euclidean", "manhattan")
linkages <- c("complete", "ward.D")

# Empty list to store plots
pheatList <- list()

a <- 1

for(d in 1:length(distances)){
  for(l in 1:length(linkages))
    {
    pheatList[[a]] <- pheatmap(t(dataScaled),
                               annotation_col = (anno),
                               show_colnames = F,
                               clustering_distance_cols = distances[d],
                               clustering_method = linkages[l],
                               main = paste(linkages[l], distances[d]),
                               silent = T,
                              fontsize = 4
                               )[[4]]
    a <- a + 1
  }
}

plot_grid(plotlist = pheatList)

The generated heat maps successfully displayed the clustering patterns of the scaled dataset under the four different combinations of distance metrics and linkage methods. Each map provided a slightly different representation of how patients group together based on their physiological attributes. Among the tested combinations, the Ward.D linkage combined with the Manhattan distance metric appeared to produce the most cohesive and well-balanced clusters, suggesting that this pairing best captures the natural structure of the data while maintaining robustness to outliers. In contrast, the Euclidean distance yielded slightly sharper but less stable transitions between clusters. The annotation columns (Outcome, Pregnancies, and DiabetesPedigreeFunction) aligned correctly with the corresponding samples, confirming that the row names were matched properly. Overall, the code functioned as intended and provided valuable insight into how the choice of clustering parameters influences the visual interpretation of patient data.

Across all four heatmaps — complete linkage with Euclidean distance (top left), Ward.D with Euclidean distance (top right), complete linkage with Manhattan distance (bottom left), and Ward.D with Manhattan distance (bottom right) — similar global clustering patterns emerged. This consistency indicates that the dataset’s underlying structure is robust, as the grouping of samples does not drastically change across metrics. The Euclidean-based heatmaps showed sharper transitions between clusters, while Manhattan distance produced smoother gradients, reflecting reduced sensitivity to outliers — a beneficial property for biological data. The Ward.D linkage method, which minimizes within-cluster variance, generated more compact and balanced clusters than complete linkage, which tended to yield broader, uneven groupings.

Variables such as Glucose, Insulin, and BMI demonstrated the strongest variation across samples, highlighting their importance as discriminating features within the dataset. In contrast, BloodPressure and SkinThickness appeared more uniform, contributing less to cluster separation. The annotation bars for Outcome and Pregnancies suggest that patient clustering may partially align with clinical outcomes, indicating potential biological relevance. Altogether, these results show that while all four methods reveal consistent clustering patterns, Ward.D combined with Manhattan distance provides the clearest and most balanced separation, making it the most suitable method for further downstream analyses.

Selected one of the combinations of distance and linkage
1. generate clusters
2. test the clusters by chi-square if the clusters are dependent on the outcome or number of pregnancies in our previous categories of none, a few, several and many.

I will use the scaled dataset to formally generate clusters based on one of the previously explored combinations of distance metrics and linkage methods. I have chosen the Manhattan distance with the Ward.D linkage method, as this pairing tends to produce compact, well-balanced clusters while remaining relatively robust to outliers. The objective is to create hierarchical clusters and then statistically assess whether these clusters correspond to meaningful clinical or demographic distinctions in the dataset. Specifically, I will perform a chi-square test to determine whether the resulting clusters are significantly dependent on two annotation variables: Outcome (Healthy vs. Sick) and Pregnancy category (None, A few, Several, Many). A significant relationship would indicate that the clustering structure reflects underlying biological or clinical characteristics, validating the chosen approach.

To generate the clusters, I will cut the dendrogram tree to yield three distinct clusters (cutree(hcDbComb, k = 3)). This number of clusters is reasonable given the nature of the annotation variables: since there are two possible outcomes and four pregnancy categories, three clusters may capture the main patterns of overlap between these traits, for instance, distinguishing groups that are predominantly healthy, predominantly sick, or intermediate. In practice, it is often useful to test several cut points to determine the number of clusters that provides the most meaningful separation for the analysis at hand.

# Create a hierarchical cluster
hcDbComb<-hclust(dist(dataScaled,method= "manhattan"),method="ward.D")

# Cut the hierarchical cluster
clusters<-cutree(hcDbComb,k=3)


# Contingency table
table(clusters, paste(dbClean$Outcome, dbClean$pregnancy_cat,sep="_"))%>%
  chisq.test()

## Warning in chisq.test(.): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 151.69, df = 14, p-value < 2.2e-16

The code successfully generated three clusters using hierarchical clustering with the Manhattan distance and Ward.D linkage method. The resulting chi-square test yielded a highly significant result (X² = 151.69, df = 14, p < 2.2e-16), indicating that the clusters are not random but rather significantly associated with Outcome and Pregnancy categories. This suggests that the clustering algorithm captured meaningful biological patterns within the dataset. Patients within the same cluster tend to share similar physiological and clinical profiles, reinforcing that the chosen distance and linkage combination effectively distinguishes subgroups within the population. These findings demonstrate that hierarchical clustering, when paired with the appropriate parameters, can reveal clinically relevant groupings that correspond to disease outcome and pregnancy trends, supporting its utility in exploratory biomedical data analysis.

Using the dbClean data set

Calculate a PCA of the data
Graph the first 4 PCs and colour the graph to test if they explain the outcome, number of pregnancies (as a continuous variable), or the diabetes pedigree factor.

In this step, I will perform a Principal Component Analysis (PCA) on the cleaned dataset to reduce dimensionality and identify which variables contribute most to the overall variance. PCA will help uncover underlying structure in the data by transforming correlated variables into a smaller number of uncorrelated components. I will include all numeric predictors (from Glucose to Age, excluding DiabetesPedigreeFunction) and scale them to ensure that variables measured on different scales contribute equally to the analysis. Once the PCA is computed, I will visualize the first four principal components (PCs) using a pairwise plot. The points will be colored according to the Outcome variable (Healthy vs. Sick) to examine whether health status separates meaningfully along these components. This will help determine whether the PCA captures patterns associated with disease outcomes in the dataset.

# Selection of data and calculation of a PCA
pcaDbClean<-dbClean%>%
  # Creating a PCA
  column_to_rownames("PatientID")%>%
  select(Glucose:Age,-DiabetesPedigreeFunction)%>%
  prcomp(scale=T)

# Calculate PCA
  names(pcaDbClean)

## [1] "sdev"     "rotation" "center"   "scale"    "x"

# Graph the first 4 PCs
pcaDbClean$x%>%
  data.frame()%>%
  select(1:4)%>%
  ggpairs(progress=FALSE,aes(colour=dbClean$Outcome,alpha=0.5))

The PCA successfully reduced the dataset’s variables into four main components, each representing a linear combination of the original features. The resulting pairwise plots of PC1 through PC4 illustrate the distribution of patients colored by health status (Healthy in red, Sick in blue). The overall visualization shows a stronger separation between healthy and sick individuals along PC1 and partially across PC2–PC4; however, the more substantial overlap between the two outcome groups in other component pairings suggests that the separation captured by PCA is limited and that additional factors beyond these linear combinations influence the outcome.

More specifically, PC1 and PC2 capture most of the variation, with PC1 showing a noticeable distributional shift between healthy and sick individuals that hints at some clustering structure. In contrast, PC3 and PC4 contribute minimally to group separation, failing to form distinct clusters or subclusters. These findings indicate that PCA effectively summarizes variability in the dataset, yet the key differences between healthy and sick individuals are likely non-linear or depend on complex interactions among variables rather than single axes of variation. This insight motivates further exploration using non-linear dimensionality reduction methods such as UMAP, which may better capture subtle, multidimensional patterns in the data.

Create an optimized UMAP projection of the data, check how the variables outcome, number of pregnancies (as a continuous variable) or the diabetes pedigree factor separate on the UMAP.

I will generate optimized UMAP (Uniform Manifold Approximation and Projection) visualizations to further explore the structure of the diabetes dataset in two dimensions. Building on the PCA and clustering analyses performed earlier, UMAP will allow me to capture potential non-linear relationships among variables that PCA may have missed. By experimenting with different n_neighbors values (10 and 50), I aim to compare how local versus global data structures influence the projection and the visibility of underlying clusters. Specifically, I will examine how the Outcome (Healthy vs. Sick), Pregnancies (as a continuous variable), and DiabetesPedigreeFunction distribute across the reduced dimensions. This will help determine whether these variables show any natural separation or patterning in the manifold space, offering additional insight into how physiological, reproductive, and genetic risk factors relate to diabetic outcomes.

# Make the UMAP Data
uDbClean<-umap(dataScaled)

# Empty list to store plots
UMAPs <- list()

# 1. UMAP Outcome, n_neighbors = 10
uDbClean <- umap(dataScaled, n_neighbors = 10, metric = "manhattan", min_dist = 0.5)
UMAPs[[1]] <- uDbClean$layout %>%
  data.frame() %>%
  rename(Dim1 = X1, Dim2 = X2) %>%
  ggplot(aes(x = Dim1, y = Dim2, colour = dbClean$Outcome)) +
  geom_point() +
  labs(title = "n_neighbors = 10", color = "Outcome")

# 2. UMAP Outcome, n_neighbors = 50
uDbClean <- umap(dataScaled, n_neighbors = 50, metric = "manhattan", min_dist = 0.5)
UMAPs[[2]] <- uDbClean$layout %>%
  data.frame() %>%
  rename(Dim1 = X1, Dim2 = X2) %>%
  ggplot(aes(x = Dim1, y = Dim2, colour = dbClean$Outcome)) +
  geom_point() +
  labs(title = "n_neighbors = 50", color = "Outcome")

# 3. UMAP Pregnancies, n_neighbors = 10
uDbClean <- umap(dataScaled, n_neighbors = 10, metric = "manhattan", min_dist = 0.5)
UMAPs[[3]] <- uDbClean$layout %>%
  data.frame() %>%
  rename(Dim1 = X1, Dim2 = X2) %>%
  ggplot(aes(x = Dim1, y = Dim2, color = dbClean$Pregnancies)) +
  geom_point() +
  scale_colour_gradient(low = "blue", high = "red") +
  labs(title = "n_neighbors = 10", color = "Pregnancies")

# 4. UMAP Pregnancies, n_neighbors = 50
uDbClean <- umap(dataScaled, n_neighbors = 50, metric = "manhattan", min_dist = 0.5)
UMAPs[[4]] <- uDbClean$layout %>%
  data.frame() %>%
  rename(Dim1 = X1, Dim2 = X2) %>%
  ggplot(aes(x = Dim1, y = Dim2, color = dbClean$Pregnancies)) +
  geom_point() +
  scale_colour_gradient(low = "blue", high = "red") +
  labs(title = "n_neighbors = 50", color = "Pregnancies")

# 5. UMAP DiabetesPedigreeFunction, n_neighbors = 10
uDbClean <- umap(dataScaled, n_neighbors = 10, metric = "manhattan", min_dist = 0.5)
UMAPs[[5]] <- uDbClean$layout %>%
  data.frame() %>%
  rename(Dim1 = X1, Dim2 = X2) %>%
  ggplot(aes(x = Dim1, y = Dim2, color = dbClean$DiabetesPedigreeFunction)) +
  geom_point() +
  scale_colour_gradient(low = "blue", high = "red") +
  labs(title = "n_neighbors = 10", color = "DiabetesPedigree")

# 6. UMAP DiabetesPedigreeFunction, n_neighbors = 50
uDbClean <- umap(dataScaled, n_neighbors = 50, metric = "manhattan", min_dist = 0.5)
UMAPs[[6]] <- uDbClean$layout %>%
  data.frame() %>%
  rename(Dim1 = X1, Dim2 = X2) %>%
  ggplot(aes(x = Dim1, y = Dim2, color = dbClean$DiabetesPedigreeFunction)) +
  geom_point() +
  scale_colour_gradient(low = "blue", high = "red") +
  labs(title = "n_neighbors = 50", color = "DiabetesPedigree")

# Combine all six plots
plot_grid(plotlist = UMAPs, ncol = 2, align="hv")

The UMAP projections successfully reduced the high-dimensional diabetes dataset into a two-dimensional space, revealing patterns that complement the findings from PCA and clustering. When colored by Outcome, the plots suggest partial but not complete separation between Healthy and Sick individuals, indicating that while disease status influences the data structure, there is still substantial overlap—consistent with a multifactorial condition like diabetes. Increasing n_neighbors from 10 to 50 produced smoother, more globally coherent structures, suggesting that larger neighborhood sizes emphasize broader population trends rather than fine-grained local clusters. When visualized by Pregnancies, a gradient pattern emerged, with higher pregnancy counts forming subtle subclusters, implying a modest association between reproductive history and physiological traits captured by the model. Similarly, the DiabetesPedigreeFunction visualizations showed localized areas of higher familial risk, reinforcing that genetic predisposition contributes to certain regions of the data manifold. Overall, the UMAP results demonstrate that while variable separations are nuanced rather than absolute, they collectively highlight meaningful biological gradients in the dataset that extend beyond linear patterns identified through PCA.

Assignment 4

Michael Sava

2025-10-09

Weekly Assignment #4

Conceptual questions

Coding questions

Loading the data