Purpose

The purpose of this exercise is to practice clustering survey data using the k-means clustering algorithm. I will be using data from the National Health and Nutrition Examination Survey (NHANES), a major program of the National Center for Health Statistics (NCHS), which is part of the Centers for Disease Control and Prevention (CDC). Ultimately I want to use k-means to segment survey respondents into different groups based on their responses to NHANES’ depression screening questionnaire and see if I can isolate the respondents with moderate-to-severe depressive symptoms. By the end I hope to have successfully practiced applying k-means, tuning hyperparameters, and evaluating/interpreting the results.

Data

The foundational data used in this exercise comes from NHANES program’s 2017-March 2020 Pre-pandemic data. After doing a left joini of the depression screener data (P_DPQ) with the demographic data (P_DEMO) the dataset had the following data:

Links:

https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Demographics&Cycle=2017-2020

https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Questionnaire&Cycle=2017-2020

Import Data

library(haven)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
dep = read_xpt("/Users/bryanokani/Downloads/P_DPQ.xpt")
demo = read_xpt("/Users/bryanokani/Downloads/P_DEMO.xpt")

df = merge(dep, demo[ , c(1,4,5,8,10,29)], by = "SEQN", all.x = TRUE)
head(df)
##     SEQN DPQ010 DPQ020 DPQ030 DPQ040 DPQ050 DPQ060 DPQ070 DPQ080 DPQ090 DPQ100
## 1 109266      0      0      0      0      0      0      0      0      0     NA
## 2 109271      2      1      0      0      0      0      2      0      0      0
## 3 109273      2      2      2      2      2      2      2      1      0      0
## 4 109274      0      0      0      0      0      0      0      0      0     NA
## 5 109282      0      1      0      1      0      0      0      3      0      0
## 6 109284      0      0      0      0      0      0      0      0      0     NA
##   RIAGENDR RIDAGEYR RIDRETH3 DMDBORN4 INDFMPIR
## 1        2       29        6        2     5.00
## 2        1       49        3        1       NA
## 3        1       36        3        1     0.83
## 4        1       68        7        1     1.20
## 5        1       76        3        1     3.61
## 6        2       44        1        2       NA

Preprocessing

Omit NAs

We currently have 8965 rows and 16 columns. Since we didn’t import the entirety of the demographic dataframe we don’t need to remove anymore variables.

First, let’s remove all the observations that have missing data (NAs). This brings the total number of observations down to 4833.

df = na.omit(df)

Omit certain responses

Reading through the documentation from NHANES, we can see that for each of the survey questions in the depression screener, there are a certain number of interviewees who either responded with “Don’t Know” (value of 7 or 77) or refused to respond (value of 9 or 99). I decided to remove these observations from the data as well.

df = df[ !(df$DPQ010 == 7 | df$DPQ010 == 9) , ]
df = df[ !(df$DPQ020 == 7 | df$DPQ020 == 9) , ]
df = df[ !(df$DPQ030 == 7 | df$DPQ030 == 9) , ]
df = df[ !(df$DPQ040 == 7 | df$DPQ040 == 9) , ]
df = df[ !(df$DPQ050 == 7 | df$DPQ050 == 9) , ]
df = df[ !(df$DPQ060 == 7 | df$DPQ060 == 9) , ]
df = df[ !(df$DPQ070 == 7 | df$DPQ070 == 9) , ]
df = df[ !(df$DPQ080 == 7 | df$DPQ080 == 9) , ]
df = df[ !(df$DPQ090 == 7 | df$DPQ090 == 9) , ]
df = df[ !(df$DPQ100 == 7 | df$DPQ100 == 9) , ]
df = df[ !(df$DMDBORN4 == 77 | df$DMDBORN4 == 99) , ]

Factoring categorical demographic variables

The last step involves cleaning the categorical demographic columns. I want to factor these variables and rename the levels, as well as combine the Race/Hispanic Origin categories “Mexican American” and “Other Hispanic” into a singular “Hispanic” category for simplicity sake.

#Gender
df$RIAGENDR = as.factor(df$RIAGENDR)
levels(df$RIAGENDR) = c("Male", "Female")

#Race
df$RIDRETH3[which(df$RIDRETH3 == 1)] <- 2
df$RIDRETH3 = as.factor(df$RIDRETH3)
levels(df$RIDRETH3) = c("Hispanic", "White", "Black", "Asian", "Other")

#Country of Birth
df$DMDBORN4 = as.factor(df$DMDBORN4)
levels(df$DMDBORN4) = c("United States", "Other")

After all of that, our data is ready for analysis and clustering.

Exploratory Data Analysis of the Demographic Variables

Before clustering, I want to observe the distribution of respondents across different demographics.

Gender

ggplot(data = df, aes(RIAGENDR)) + 
  geom_bar(fill="red", color="black") + 
  xlab("Gender") +
  ylab("Count of Respondents") + 
  ggtitle("Distribution of Respondents across Gender")

Age

It is worth noting that, according to the documentation from NHANES, “due to disclosure concern, all responses of participants aged 80 years and older are coded as ‘80.’ In NHANES 2017-March 2020 pre-pandemic sample, the weighted mean age for participants 80 years and older is 85 years.”

ggplot(df, aes(RIDAGEYR)) + 
  geom_histogram(fill="red", color="black") + 
  xlab("Age of Respondents") + 
  ylab("Count of Respondents") + 
  ggtitle("Age Distribution of Respondents")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Race

Non-Hispanic White and Non-Hispanic Black respondents are the two largest groups of respondents.

ggplot(df, aes(RIDRETH3)) + 
  geom_bar(fill="red", color="black", position = "identity") + 
  xlab("Race of Respondents") +
  ylab("Count of Respondents") +
  ggtitle("Racial Distribution of Respondents")

Country of Birth

The vast majority of respondents were born in the United States.

ggplot(df, aes(as.factor(DMDBORN4))) + 
  geom_bar(fill="red", color="black", position = "identity") + 
  xlab("Country of Birth") + 
  ylab("Count of Respondents") +
  ggtitle("Respondents' Country of Birth")

Income-to-Poverty Ratio

This is an interesting variable I wanted to include in this analysis. Per the CDC, “This variable is the ratio of family income to poverty. The Department of Health and Human Services (HHS) poverty guidelines were used as the poverty measure to calculate this ratio… INDFMPIR was calculated by dividing total annual family (or individual) income by the poverty guidelines specific to the survey year… INDFMPIR values at or above 5.00 were coded as 5.00 or more because of disclosure concerns.”

Around 850 respondents have a ratio value coded as 5.00, i.e., about 850 respondents are making 5 times the poverty threshold determined by HHS. There also seems to be a bell forming around a mean close to 1.00, which would indicate that a lot of respondents/respondents’ families are only a little below or slightly above the poverty line.

ggplot(df, aes(INDFMPIR)) + 
  geom_histogram(fill="red", color="black") + 
  xlab("Income-to-Poverty Ratio Distributions") + 
  ylab("Count of Respondents") + 
  ggtitle("Income-to-Poverty Ratio Distributions")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

For the purposes of clustering, we are going to exclude these categorical variables for the sake of simplicity.

Cluster Analysis Using the Given Features

We will start off by clustering using the given features of the data do determine whether or not we need to reduce dimensions.

Selecting k

Within Sum of Squares

We first need to determine what an optimal amount of clusters would be since we don’t have a predetermined value for k. The package factoextra will generate a Within Sum of Squares plot suggesting the best values for k.

set.seed(123)
fviz_nbclust(df[ , 2:11], kmeans, nstart=10, k.max = 10, method = "wss")
## Warning: did not converge in 10 iterations

There is a hard elbow at k = 2, suggesting that 2 is a good number of clusters for this data.

Silhouette Widths

We can also plot average silhouette widths for different numbers of clusters.

fviz_nbclust(df[ , 2:11], kmeans, nstart=10, k.max = 10, method = "silhouette")
## Warning: did not converge in 10 iterations

This suggests that the optimal number of clusters is also 2, so we will use k = 2 to create our cluster object.

Clustering, k=2

set.seed(123)
cluster = kmeans(df[ , 2:11], centers = 2, nstart = 10)

Between SS -to- Within SS ratio

After creating the cluster, we can check the ratio of Between Sum of Squares to total Within Sum of Squares. The entire objective of cluster analysis is to minimize intra-cluster distances (Within Sum of Squares) and maximizing inter-cluster distances (Between Sum of Squares). This means we want this ratio to be fairly high, as that would indicate how good k-means is at separating the clusters.

cluster$betweenss / cluster$tot.withinss
## [1] 0.3789937

This ratio is fairly low, which would suggest that the two clusters were not separated very well. And if we visualize the clusters…

fviz_cluster(cluster, geom="point", data = df[, 2:11]) + 
  ggtitle("K Means Clustering, Given Features, K=2")

…we can see that there is in fact some overlap between the clusters.

Clustering using given features: Conclusion

So overall, clustering using the given features doesn’t have great results. The silhouette widths for all cluster amounts are kind of low, the BetweenSS/WithinSS ratio is incredibly low, and the result is two clusters that don’t have clear separation.

Cluster Analysis using PCA

Since using the given features didn’t work, we’ll use Principal Component Analysis (PCA) to construct a new set of derivative features that capture a majority of the variation. These new variables will essentially be weighted averages of the original variables. I’ll name the object “pca.”

pca = prcomp(df[ , 2:11])
summary(pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.5639 0.8959 0.77851 0.75774 0.69783 0.63445 0.55383
## Proportion of Variance 0.3959 0.1299 0.09811 0.09295 0.07883 0.06516 0.04965
## Cumulative Proportion  0.3959 0.5258 0.62394 0.71689 0.79572 0.86088 0.91053
##                            PC8     PC9    PC10
## Standard deviation     0.50385 0.44923 0.31150
## Proportion of Variance 0.04109 0.03267 0.01571
## Cumulative Proportion  0.95162 0.98429 1.00000

Checking summary for pca, we see that the first two principal components cumulatively explain a majority of the variation.

Let’s extract both the rotation matrix (basically, the weights associated with each original variable within each new principal component) and the scores table (the newly constructed values for each principal component).

### Create scores table
scores = as.data.frame(pca$x) # the "x" table is the PCA scores
head(scores)
##           PC1        PC2        PC3         PC4        PC5          PC6
## 3  -3.3262148 -0.6265220 -0.2553589 -0.41153137 -0.0577104  0.145180502
## 5   0.4500795 -0.8136352 -0.1747562  0.09570051  0.7172132 -0.411458639
## 8   0.8445086 -1.0644787 -0.8042956  0.20005289 -0.6246241  0.016955165
## 10 -0.9783912 -1.6742050 -0.3720457  1.00718528  0.6351483 -2.922131903
## 11  0.7986192 -1.0014498 -0.3262980 -0.94103994 -0.1689869 -0.920723411
## 16  1.4754906 -0.4891681  0.1368631 -1.00078711 -0.2075593  0.004397236
##            PC7          PC8         PC9        PC10
## 3   0.34408901 -0.186495461 -1.30807094  0.52012058
## 5  -2.59148212  1.023703197 -0.68149480  0.28411402
## 8  -0.01595065  0.428749376  0.84464109  0.01244286
## 10  0.82185371 -0.078925826 -0.84517872  0.66556387
## 11  0.30428902 -0.004515144 -0.18253003  0.18893235
## 16  0.01729559  0.010093016  0.07688302 -0.04369326
### Observe weights from the rotation matrix
rot = as.data.frame(pca$rotation)
rot
##                PC1         PC2         PC3          PC4         PC5
## DPQ010 -0.33888239 -0.27775242 -0.10258661  0.316899939 -0.67032920
## DPQ020 -0.37696747 -0.26925667 -0.24914754  0.096533814 -0.07758497
## DPQ030 -0.41597831  0.84400063 -0.30732244 -0.066746187 -0.11637963
## DPQ040 -0.38376639  0.13524591  0.68856154  0.537047636  0.24929013
## DPQ050 -0.34426532 -0.08131655  0.49654845 -0.754635609 -0.22841029
## DPQ060 -0.29990396 -0.24302510 -0.21401363 -0.036786644  0.11615731
## DPQ070 -0.31173108 -0.17193554 -0.21746242 -0.126381311  0.60053311
## DPQ080 -0.20298084 -0.09059098 -0.08482831 -0.097243146  0.17488567
## DPQ090 -0.09281248 -0.06532737 -0.05799557 -0.007063644  0.03958507
## DPQ100 -0.25939741 -0.10961806 -0.09287618  0.032770642  0.10243909
##                 PC6         PC7         PC8          PC9          PC10
## DPQ010  0.461557192 -0.03373173 -0.15829319 -0.091853438 -0.0074051692
## DPQ020 -0.436889004  0.20520261  0.68149147 -0.068803644  0.0764133047
## DPQ030 -0.001634186  0.02545251 -0.02450508 -0.030451572 -0.0005786968
## DPQ040 -0.070199072  0.01008263 -0.01025077 -0.067741763  0.0050578993
## DPQ050  0.010769052  0.07658743  0.02469282  0.008339065 -0.0012645148
## DPQ060 -0.488231644  0.08179082 -0.69609963 -0.190609406  0.1562122994
## DPQ070  0.583403612  0.29421874  0.03583951 -0.144937801  0.0119942811
## DPQ080  0.034000418 -0.91582517  0.12235410 -0.204497783  0.0816905195
## DPQ090 -0.105634579 -0.04161213 -0.04485256 -0.067649887 -0.9811520403
## DPQ100 -0.001341207 -0.12812969 -0.07984910  0.936754214 -0.0141365357

Selecting k

Using the scores table as our new set of features, we can generate a plot of Total Within Sum of Squares like we did before and determine the optimal number of clusters.

set.seed(123)
fviz_nbclust(scores[ , 1:2], kmeans, nstart=10, k.max = 10, method = "wss")

The plot shows a sharp elbow at 2 and then a slightly more subtle bend at 4, indicated that the optimal number of clusters is probably within the 2-4 range. Given the nature of the data and the nuance we’re trying to extract from this study, we’ll try k=4.

set.seed(123)
cluster2 = kmeans(scores[ , 1:2], centers = 4, nstart = 10)

Between SS -to- Within SS ratio

If we observe the BetweenSS/WithinSS ratio …

cluster2$betweenss / cluster2$tot.withinss
## [1] 3.166259

… we see that the ratio using the new features is much higher than it was when we used the given features, indicating that there’s better separation between the clusters.

Visualizing in 2D

Let’s visualize a scatter plot with the two principal components and k=4 clusters. I will add the cluster vector from the cluster object to the scores table so that we can color each observation by its assigned cluster.

scores$cluster = as.character(cluster2$cluster)

ggplot(scores, aes(x=PC1, y=PC2)) + 
  geom_point(aes(color=cluster)) + 
  scale_color_manual(values = c("red", "gold", "dark green", "blue")) + 
  ggtitle("k-Means Clustering, Principal Components, K=4")

Visualizing in 3D

We can also use third principal component to visualize the clusters in 3 dimensions (I don’t like this specific visualization as much but plotting in 3D is a fun exercise).

plot_ly(data = scores, x=scores$PC1, y=scores$PC2, z=scores$PC3, color = scores$cluster)
## No trace type specified:
##   Based on info supplied, a 'scatter3d' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter3d
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

Clustering using PCA: Conclusion

Using PCA resulted in much better separation between the clusters. This will allow us to evaluate the clusters and see if respondents were properly separated into appropriate groups based on their responses to the survey.

Evaluation and Interpretation

First we’ll create a new data frame called “survey” using the original features and adding the cluster labels as a new column.

survey = cbind(df, cluster = cluster2$cluster)
survey$cluster = as.factor(survey$cluster)

The next step I want to take is adding a column called PHQ9 Score. This represents a respondent’s score on the depression module of the Patient Health Questionnaire ((Kroenke and Spitzer, 2002; Kroenke et al., 2001)). This was the nine-item depression screening that was administered to each interviewee, i.e., columns DPQ010 - DPQ090 in our dataset.

survey$PHQ9_score = rowSums(survey[ , 2:10])

From here, we can use the PHQ-9 scores to categorize the severity of the respondents depressive symptoms based on established cut-off points:

We’ll incorporate this logic into a new column, “Depressive Symptoms Category”

survey$depressive_symptoms_cat = ""
survey$depressive_symptoms_cat[survey$PHQ9_score >= 0 & survey$PHQ9_score<=4] <- "None"
survey$depressive_symptoms_cat[survey$PHQ9_score >= 5 & survey$PHQ9_score<=9] <- "Mild"
survey$depressive_symptoms_cat[survey$PHQ9_score >= 10 & survey$PHQ9_score<=14] <- "Moderate"
survey$depressive_symptoms_cat[survey$PHQ9_score >= 15 & survey$PHQ9_score<=19] <- "Moderately-Severe"
survey$depressive_symptoms_cat[survey$PHQ9_score >= 20 & survey$PHQ9_score<=27] <- "Severe"

Visualization

From there we can visualize the distribution of PHQ-9 scores per cluster and see which ranges/depressive symptom categories our clusters have been separated into.

ggplot(survey, aes(cluster, PHQ9_score)) + 
  geom_boxplot(outlier.color = "red") +
  xlab("Clusters") + 
  ylab("PHQ-9 Scores") +
  ggtitle("Distribution of Questionnaire Scores per Cluster")

Based on the boxplots, we can tell that group 2 primarily contains respondents with no depressive symptoms, while group 3 primarily contains respondents with moderate to severe depressive symptoms, with a median at the moderately-severe cut-off point. So for the most part, our ultimate goal of isolating these respondents into a group was accomplished.

Analysis of Group 3 (Moderate-Severe Depressive Symptoms)

If we want to go further, we can study the members of group 3 further to see what patterns from the demographic data and survey results we can observe. I’ll create a data frame called “group3” with all the members of the third cluster.

group3 = survey[ survey$cluster == 3 , ]

Visualizations

Age

ggplot(data = group3, aes(RIDAGEYR)) + 
  geom_histogram(fill="red", color="black") + 
  geom_vline(xintercept = mean(group3$RIDAGEYR), linetype = "dashed", color="green") +
  geom_vline(xintercept = mean(df$RIDAGEYR), linetype = "dashed", color="blue") +
  ggtitle("Age Distribution of Respondents in Cluster 3") +
  xlab("Age") +
  ylab("Count of Respondents")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

If we plot age, and include vertical lines representing the means of the original dataset (blue) and group 3 (green), we can see that the average age of respondents in group 3 is slightly younger than the average age of the entire sample.

Income-to-Poverty Ratio

ggplot(data = group3, aes(INDFMPIR)) + 
  geom_histogram(fill="red", color="black") + 
  geom_vline(xintercept = mean(group3$INDFMPIR), linetype = "dashed", color="green", size=1) +
  geom_vline(xintercept = mean(df$INDFMPIR), linetype = "dashed", color="blue", size=1) +
  ggtitle("Distribution of Income-to-Poverty Ratios for Cluster 3") + 
  xlab("Income-to-Poverty Ratios") +
  ylab("Count of Respondents")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In observing the family income-to-poverty ratio for group 3 respondents, we can see that the average ratio (green) is noticeably lower than the average for the entire sample (blue). A large proportion of respondents appear to be at and around the poverty threshold (a ratio of 1). The number of respondents in the 5th bin, representing a ratio less than 1 (i.e. family income that is below the poverty line) is nearly equal to that of the last bin representing respondents who make 5+ times the poverty line. This seems to suggest something similar to studies that have made linkages between levels of depression and poverty.

Depressive Symptoms

We can visualize the distribution of respondents based on their scores…

ggplot(group3, aes(PHQ9_score)) + 
  geom_histogram(fill="red", color="black", binwidth = 1) + 
  geom_vline(xintercept = mean(group3$PHQ9_score), linetype = "dashed", color="green", size=1) + 
  ggtitle("Distribution of Questionnaire Scores for Cluster 3") +
  xlab("Questionnaire Scores") +
  ylab("Count of Respondents")

…and their respective depressive symptom categories.

ggplot(group3, aes(depressive_symptoms_cat)) + 
  geom_bar(fill="red", color="black", position = "identity") +
  ggtitle("Categorization of Depressive Symptoms for Cluster 3") + 
  xlab("Depressive Symptom Categories") + 
  ylab("Count of Respondents")

We can see that the mean score of the group is greater than 15 (and the median is exactly 15), suggesting that the average person in this group displays moderately-severe depressive symptoms. With that said, the bar chart shows that largest proportion of respondents actually have moderate depressive symptoms.

Conclusion

In conclusion, we practiced data processing, analysis and visualization, and cluster analysis using both given features and extracted features. We successfully (more or less) grouped respondents with moderate-to-severe depressive symptoms without having to label the observations. And through further analysis of this group we observed that these respondents were younger and closer to the poverty line, on average.