The purpose of this exercise is to practice clustering survey data using the k-means clustering algorithm. I will be using data from the National Health and Nutrition Examination Survey (NHANES), a major program of the National Center for Health Statistics (NCHS), which is part of the Centers for Disease Control and Prevention (CDC). Ultimately I want to use k-means to segment survey respondents into different groups based on their responses to NHANES’ depression screening questionnaire and see if I can isolate the respondents with moderate-to-severe depressive symptoms. By the end I hope to have successfully practiced applying k-means, tuning hyperparameters, and evaluating/interpreting the results.
The foundational data used in this exercise comes from NHANES program’s 2017-March 2020 Pre-pandemic data. After doing a left joini of the depression screener data (P_DPQ) with the demographic data (P_DEMO) the dataset had the following data:
SEQN - Respondent sequence number
RIAGENDR - Gender
RIDAGEYR - Age in years at screening
RIDRETH3 - Race/Hispanic origin
DMDBORN4 - Country of birth
INDFMPIR - Ratio of family income to poverty
DPQ010 - Have little interest in doing things
DPQ020 - Feeling down, depressed, or hopeless
DPQ030 - Trouble sleeping or sleeping too much
DPQ040 - Feeling tired or having little energy
DPQ050 - Poor appetite or overeating
DPQ060 - Feeling bad about yourself
DPQ070 - Trouble concentrating on things
DPQ080 - Moving or speaking slowly or too fast
DPQ090 - Thoughts you would be better off dead
DPQ100 - Difficulty these problems have caused
Links:
https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Demographics&Cycle=2017-2020
https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Questionnaire&Cycle=2017-2020
library(haven)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
dep = read_xpt("/Users/bryanokani/Downloads/P_DPQ.xpt")
demo = read_xpt("/Users/bryanokani/Downloads/P_DEMO.xpt")
df = merge(dep, demo[ , c(1,4,5,8,10,29)], by = "SEQN", all.x = TRUE)
head(df)
## SEQN DPQ010 DPQ020 DPQ030 DPQ040 DPQ050 DPQ060 DPQ070 DPQ080 DPQ090 DPQ100
## 1 109266 0 0 0 0 0 0 0 0 0 NA
## 2 109271 2 1 0 0 0 0 2 0 0 0
## 3 109273 2 2 2 2 2 2 2 1 0 0
## 4 109274 0 0 0 0 0 0 0 0 0 NA
## 5 109282 0 1 0 1 0 0 0 3 0 0
## 6 109284 0 0 0 0 0 0 0 0 0 NA
## RIAGENDR RIDAGEYR RIDRETH3 DMDBORN4 INDFMPIR
## 1 2 29 6 2 5.00
## 2 1 49 3 1 NA
## 3 1 36 3 1 0.83
## 4 1 68 7 1 1.20
## 5 1 76 3 1 3.61
## 6 2 44 1 2 NA
We currently have 8965 rows and 16 columns. Since we didn’t import the entirety of the demographic dataframe we don’t need to remove anymore variables.
First, let’s remove all the observations that have missing data (NAs). This brings the total number of observations down to 4833.
df = na.omit(df)
Reading through the documentation from NHANES, we can see that for each of the survey questions in the depression screener, there are a certain number of interviewees who either responded with “Don’t Know” (value of 7 or 77) or refused to respond (value of 9 or 99). I decided to remove these observations from the data as well.
df = df[ !(df$DPQ010 == 7 | df$DPQ010 == 9) , ]
df = df[ !(df$DPQ020 == 7 | df$DPQ020 == 9) , ]
df = df[ !(df$DPQ030 == 7 | df$DPQ030 == 9) , ]
df = df[ !(df$DPQ040 == 7 | df$DPQ040 == 9) , ]
df = df[ !(df$DPQ050 == 7 | df$DPQ050 == 9) , ]
df = df[ !(df$DPQ060 == 7 | df$DPQ060 == 9) , ]
df = df[ !(df$DPQ070 == 7 | df$DPQ070 == 9) , ]
df = df[ !(df$DPQ080 == 7 | df$DPQ080 == 9) , ]
df = df[ !(df$DPQ090 == 7 | df$DPQ090 == 9) , ]
df = df[ !(df$DPQ100 == 7 | df$DPQ100 == 9) , ]
df = df[ !(df$DMDBORN4 == 77 | df$DMDBORN4 == 99) , ]
The last step involves cleaning the categorical demographic columns. I want to factor these variables and rename the levels, as well as combine the Race/Hispanic Origin categories “Mexican American” and “Other Hispanic” into a singular “Hispanic” category for simplicity sake.
#Gender
df$RIAGENDR = as.factor(df$RIAGENDR)
levels(df$RIAGENDR) = c("Male", "Female")
#Race
df$RIDRETH3[which(df$RIDRETH3 == 1)] <- 2
df$RIDRETH3 = as.factor(df$RIDRETH3)
levels(df$RIDRETH3) = c("Hispanic", "White", "Black", "Asian", "Other")
#Country of Birth
df$DMDBORN4 = as.factor(df$DMDBORN4)
levels(df$DMDBORN4) = c("United States", "Other")
After all of that, our data is ready for analysis and clustering.
Before clustering, I want to observe the distribution of respondents across different demographics.
ggplot(data = df, aes(RIAGENDR)) +
geom_bar(fill="red", color="black") +
xlab("Gender") +
ylab("Count of Respondents") +
ggtitle("Distribution of Respondents across Gender")
It is worth noting that, according to the documentation from NHANES, “due to disclosure concern, all responses of participants aged 80 years and older are coded as ‘80.’ In NHANES 2017-March 2020 pre-pandemic sample, the weighted mean age for participants 80 years and older is 85 years.”
ggplot(df, aes(RIDAGEYR)) +
geom_histogram(fill="red", color="black") +
xlab("Age of Respondents") +
ylab("Count of Respondents") +
ggtitle("Age Distribution of Respondents")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Non-Hispanic White and Non-Hispanic Black respondents are the two largest groups of respondents.
ggplot(df, aes(RIDRETH3)) +
geom_bar(fill="red", color="black", position = "identity") +
xlab("Race of Respondents") +
ylab("Count of Respondents") +
ggtitle("Racial Distribution of Respondents")
The vast majority of respondents were born in the United States.
ggplot(df, aes(as.factor(DMDBORN4))) +
geom_bar(fill="red", color="black", position = "identity") +
xlab("Country of Birth") +
ylab("Count of Respondents") +
ggtitle("Respondents' Country of Birth")
This is an interesting variable I wanted to include in this analysis. Per the CDC, “This variable is the ratio of family income to poverty. The Department of Health and Human Services (HHS) poverty guidelines were used as the poverty measure to calculate this ratio… INDFMPIR was calculated by dividing total annual family (or individual) income by the poverty guidelines specific to the survey year… INDFMPIR values at or above 5.00 were coded as 5.00 or more because of disclosure concerns.”
Around 850 respondents have a ratio value coded as 5.00, i.e., about 850 respondents are making 5 times the poverty threshold determined by HHS. There also seems to be a bell forming around a mean close to 1.00, which would indicate that a lot of respondents/respondents’ families are only a little below or slightly above the poverty line.
ggplot(df, aes(INDFMPIR)) +
geom_histogram(fill="red", color="black") +
xlab("Income-to-Poverty Ratio Distributions") +
ylab("Count of Respondents") +
ggtitle("Income-to-Poverty Ratio Distributions")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
For the purposes of clustering, we are going to exclude these categorical variables for the sake of simplicity.
We will start off by clustering using the given features of the data do determine whether or not we need to reduce dimensions.
We first need to determine what an optimal amount of clusters would be since we don’t have a predetermined value for k. The package factoextra will generate a Within Sum of Squares plot suggesting the best values for k.
set.seed(123)
fviz_nbclust(df[ , 2:11], kmeans, nstart=10, k.max = 10, method = "wss")
## Warning: did not converge in 10 iterations
There is a hard elbow at k = 2, suggesting that 2 is a good number of clusters for this data.
We can also plot average silhouette widths for different numbers of clusters.
fviz_nbclust(df[ , 2:11], kmeans, nstart=10, k.max = 10, method = "silhouette")
## Warning: did not converge in 10 iterations
This suggests that the optimal number of clusters is also 2, so we will use k = 2 to create our cluster object.
set.seed(123)
cluster = kmeans(df[ , 2:11], centers = 2, nstart = 10)
After creating the cluster, we can check the ratio of Between Sum of Squares to total Within Sum of Squares. The entire objective of cluster analysis is to minimize intra-cluster distances (Within Sum of Squares) and maximizing inter-cluster distances (Between Sum of Squares). This means we want this ratio to be fairly high, as that would indicate how good k-means is at separating the clusters.
cluster$betweenss / cluster$tot.withinss
## [1] 0.3789937
This ratio is fairly low, which would suggest that the two clusters were not separated very well. And if we visualize the clusters…
fviz_cluster(cluster, geom="point", data = df[, 2:11]) +
ggtitle("K Means Clustering, Given Features, K=2")
…we can see that there is in fact some overlap between the clusters.
So overall, clustering using the given features doesn’t have great results. The silhouette widths for all cluster amounts are kind of low, the BetweenSS/WithinSS ratio is incredibly low, and the result is two clusters that don’t have clear separation.
Since using the given features didn’t work, we’ll use Principal Component Analysis (PCA) to construct a new set of derivative features that capture a majority of the variation. These new variables will essentially be weighted averages of the original variables. I’ll name the object “pca.”
pca = prcomp(df[ , 2:11])
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.5639 0.8959 0.77851 0.75774 0.69783 0.63445 0.55383
## Proportion of Variance 0.3959 0.1299 0.09811 0.09295 0.07883 0.06516 0.04965
## Cumulative Proportion 0.3959 0.5258 0.62394 0.71689 0.79572 0.86088 0.91053
## PC8 PC9 PC10
## Standard deviation 0.50385 0.44923 0.31150
## Proportion of Variance 0.04109 0.03267 0.01571
## Cumulative Proportion 0.95162 0.98429 1.00000
Checking summary for pca, we see that the first two principal components cumulatively explain a majority of the variation.
Let’s extract both the rotation matrix (basically, the weights associated with each original variable within each new principal component) and the scores table (the newly constructed values for each principal component).
### Create scores table
scores = as.data.frame(pca$x) # the "x" table is the PCA scores
head(scores)
## PC1 PC2 PC3 PC4 PC5 PC6
## 3 -3.3262148 -0.6265220 -0.2553589 -0.41153137 -0.0577104 0.145180502
## 5 0.4500795 -0.8136352 -0.1747562 0.09570051 0.7172132 -0.411458639
## 8 0.8445086 -1.0644787 -0.8042956 0.20005289 -0.6246241 0.016955165
## 10 -0.9783912 -1.6742050 -0.3720457 1.00718528 0.6351483 -2.922131903
## 11 0.7986192 -1.0014498 -0.3262980 -0.94103994 -0.1689869 -0.920723411
## 16 1.4754906 -0.4891681 0.1368631 -1.00078711 -0.2075593 0.004397236
## PC7 PC8 PC9 PC10
## 3 0.34408901 -0.186495461 -1.30807094 0.52012058
## 5 -2.59148212 1.023703197 -0.68149480 0.28411402
## 8 -0.01595065 0.428749376 0.84464109 0.01244286
## 10 0.82185371 -0.078925826 -0.84517872 0.66556387
## 11 0.30428902 -0.004515144 -0.18253003 0.18893235
## 16 0.01729559 0.010093016 0.07688302 -0.04369326
### Observe weights from the rotation matrix
rot = as.data.frame(pca$rotation)
rot
## PC1 PC2 PC3 PC4 PC5
## DPQ010 -0.33888239 -0.27775242 -0.10258661 0.316899939 -0.67032920
## DPQ020 -0.37696747 -0.26925667 -0.24914754 0.096533814 -0.07758497
## DPQ030 -0.41597831 0.84400063 -0.30732244 -0.066746187 -0.11637963
## DPQ040 -0.38376639 0.13524591 0.68856154 0.537047636 0.24929013
## DPQ050 -0.34426532 -0.08131655 0.49654845 -0.754635609 -0.22841029
## DPQ060 -0.29990396 -0.24302510 -0.21401363 -0.036786644 0.11615731
## DPQ070 -0.31173108 -0.17193554 -0.21746242 -0.126381311 0.60053311
## DPQ080 -0.20298084 -0.09059098 -0.08482831 -0.097243146 0.17488567
## DPQ090 -0.09281248 -0.06532737 -0.05799557 -0.007063644 0.03958507
## DPQ100 -0.25939741 -0.10961806 -0.09287618 0.032770642 0.10243909
## PC6 PC7 PC8 PC9 PC10
## DPQ010 0.461557192 -0.03373173 -0.15829319 -0.091853438 -0.0074051692
## DPQ020 -0.436889004 0.20520261 0.68149147 -0.068803644 0.0764133047
## DPQ030 -0.001634186 0.02545251 -0.02450508 -0.030451572 -0.0005786968
## DPQ040 -0.070199072 0.01008263 -0.01025077 -0.067741763 0.0050578993
## DPQ050 0.010769052 0.07658743 0.02469282 0.008339065 -0.0012645148
## DPQ060 -0.488231644 0.08179082 -0.69609963 -0.190609406 0.1562122994
## DPQ070 0.583403612 0.29421874 0.03583951 -0.144937801 0.0119942811
## DPQ080 0.034000418 -0.91582517 0.12235410 -0.204497783 0.0816905195
## DPQ090 -0.105634579 -0.04161213 -0.04485256 -0.067649887 -0.9811520403
## DPQ100 -0.001341207 -0.12812969 -0.07984910 0.936754214 -0.0141365357
Using the scores table as our new set of features, we can generate a plot of Total Within Sum of Squares like we did before and determine the optimal number of clusters.
set.seed(123)
fviz_nbclust(scores[ , 1:2], kmeans, nstart=10, k.max = 10, method = "wss")
The plot shows a sharp elbow at 2 and then a slightly more subtle bend at 4, indicated that the optimal number of clusters is probably within the 2-4 range. Given the nature of the data and the nuance we’re trying to extract from this study, we’ll try k=4.
set.seed(123)
cluster2 = kmeans(scores[ , 1:2], centers = 4, nstart = 10)
If we observe the BetweenSS/WithinSS ratio …
cluster2$betweenss / cluster2$tot.withinss
## [1] 3.166259
… we see that the ratio using the new features is much higher than it was when we used the given features, indicating that there’s better separation between the clusters.
Let’s visualize a scatter plot with the two principal components and k=4 clusters. I will add the cluster vector from the cluster object to the scores table so that we can color each observation by its assigned cluster.
scores$cluster = as.character(cluster2$cluster)
ggplot(scores, aes(x=PC1, y=PC2)) +
geom_point(aes(color=cluster)) +
scale_color_manual(values = c("red", "gold", "dark green", "blue")) +
ggtitle("k-Means Clustering, Principal Components, K=4")
We can also use third principal component to visualize the clusters in 3 dimensions (I don’t like this specific visualization as much but plotting in 3D is a fun exercise).
plot_ly(data = scores, x=scores$PC1, y=scores$PC2, z=scores$PC3, color = scores$cluster)
## No trace type specified:
## Based on info supplied, a 'scatter3d' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter3d
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
Using PCA resulted in much better separation between the clusters. This will allow us to evaluate the clusters and see if respondents were properly separated into appropriate groups based on their responses to the survey.
First we’ll create a new data frame called “survey” using the original features and adding the cluster labels as a new column.
survey = cbind(df, cluster = cluster2$cluster)
survey$cluster = as.factor(survey$cluster)
The next step I want to take is adding a column called PHQ9 Score. This represents a respondent’s score on the depression module of the Patient Health Questionnaire ((Kroenke and Spitzer, 2002; Kroenke et al., 2001)). This was the nine-item depression screening that was administered to each interviewee, i.e., columns DPQ010 - DPQ090 in our dataset.
survey$PHQ9_score = rowSums(survey[ , 2:10])
From here, we can use the PHQ-9 scores to categorize the severity of the respondents depressive symptoms based on established cut-off points:
0-4: no depressive symptoms
5-9: mild depressive symptoms
10-14: moderate depressive symptoms
15-19: moderately-severe depressive symptoms
20-27: severe depressive symptoms
We’ll incorporate this logic into a new column, “Depressive Symptoms Category”
survey$depressive_symptoms_cat = ""
survey$depressive_symptoms_cat[survey$PHQ9_score >= 0 & survey$PHQ9_score<=4] <- "None"
survey$depressive_symptoms_cat[survey$PHQ9_score >= 5 & survey$PHQ9_score<=9] <- "Mild"
survey$depressive_symptoms_cat[survey$PHQ9_score >= 10 & survey$PHQ9_score<=14] <- "Moderate"
survey$depressive_symptoms_cat[survey$PHQ9_score >= 15 & survey$PHQ9_score<=19] <- "Moderately-Severe"
survey$depressive_symptoms_cat[survey$PHQ9_score >= 20 & survey$PHQ9_score<=27] <- "Severe"
From there we can visualize the distribution of PHQ-9 scores per cluster and see which ranges/depressive symptom categories our clusters have been separated into.
ggplot(survey, aes(cluster, PHQ9_score)) +
geom_boxplot(outlier.color = "red") +
xlab("Clusters") +
ylab("PHQ-9 Scores") +
ggtitle("Distribution of Questionnaire Scores per Cluster")
Based on the boxplots, we can tell that group 2 primarily contains respondents with no depressive symptoms, while group 3 primarily contains respondents with moderate to severe depressive symptoms, with a median at the moderately-severe cut-off point. So for the most part, our ultimate goal of isolating these respondents into a group was accomplished.
If we want to go further, we can study the members of group 3 further to see what patterns from the demographic data and survey results we can observe. I’ll create a data frame called “group3” with all the members of the third cluster.
group3 = survey[ survey$cluster == 3 , ]
ggplot(data = group3, aes(RIDAGEYR)) +
geom_histogram(fill="red", color="black") +
geom_vline(xintercept = mean(group3$RIDAGEYR), linetype = "dashed", color="green") +
geom_vline(xintercept = mean(df$RIDAGEYR), linetype = "dashed", color="blue") +
ggtitle("Age Distribution of Respondents in Cluster 3") +
xlab("Age") +
ylab("Count of Respondents")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
If we plot age, and include vertical lines representing the means of the original dataset (blue) and group 3 (green), we can see that the average age of respondents in group 3 is slightly younger than the average age of the entire sample.
ggplot(data = group3, aes(INDFMPIR)) +
geom_histogram(fill="red", color="black") +
geom_vline(xintercept = mean(group3$INDFMPIR), linetype = "dashed", color="green", size=1) +
geom_vline(xintercept = mean(df$INDFMPIR), linetype = "dashed", color="blue", size=1) +
ggtitle("Distribution of Income-to-Poverty Ratios for Cluster 3") +
xlab("Income-to-Poverty Ratios") +
ylab("Count of Respondents")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In observing the family income-to-poverty ratio for group 3 respondents, we can see that the average ratio (green) is noticeably lower than the average for the entire sample (blue). A large proportion of respondents appear to be at and around the poverty threshold (a ratio of 1). The number of respondents in the 5th bin, representing a ratio less than 1 (i.e. family income that is below the poverty line) is nearly equal to that of the last bin representing respondents who make 5+ times the poverty line. This seems to suggest something similar to studies that have made linkages between levels of depression and poverty.
We can visualize the distribution of respondents based on their scores…
ggplot(group3, aes(PHQ9_score)) +
geom_histogram(fill="red", color="black", binwidth = 1) +
geom_vline(xintercept = mean(group3$PHQ9_score), linetype = "dashed", color="green", size=1) +
ggtitle("Distribution of Questionnaire Scores for Cluster 3") +
xlab("Questionnaire Scores") +
ylab("Count of Respondents")
…and their respective depressive symptom categories.
ggplot(group3, aes(depressive_symptoms_cat)) +
geom_bar(fill="red", color="black", position = "identity") +
ggtitle("Categorization of Depressive Symptoms for Cluster 3") +
xlab("Depressive Symptom Categories") +
ylab("Count of Respondents")
We can see that the mean score of the group is greater than 15 (and the median is exactly 15), suggesting that the average person in this group displays moderately-severe depressive symptoms. With that said, the bar chart shows that largest proportion of respondents actually have moderate depressive symptoms.
In conclusion, we practiced data processing, analysis and visualization, and cluster analysis using both given features and extracted features. We successfully (more or less) grouped respondents with moderate-to-severe depressive symptoms without having to label the observations. And through further analysis of this group we observed that these respondents were younger and closer to the poverty line, on average.