Sample size is 9,470

Clusters

Age Ethnicity Education Gender Children Income Marital Status Cluster Size
25-34 White/Other High School Male No 100-125k Married 1393
55-64 White/Other College Grad Male No 75-100k Married 1211
25-34 Hispanic High School Male No 25-35k Married 874
65+ White/Other Graduate Degree Male No 50-75k Single 943
25-34 White/Other High School Female Yes 125-150k Single 1775
35-44 African American College Grad Male No 50-75k Single 628
25-34 Hispanic College Grad Female Yes 25-35k Single 557
35-44 Hispanic College Grad Female NO 125-150k Married 754
65+ Hispanic High School Male Yes 75-100k Single 627
25-34 White/Other High School Female NO 50-75k Single 708

Plots

There are 8 plots below showing the distribution of each variable we are interested in.

This plot is showing 2 dimensions of what the data looks like. There are actually 8 dimensions that we will be clustering.

Explanation

With 8 variables to consider, varying between 2 and 11 types in each, the total number of combinations in our data is 16,896.

The rule of thumb when picking the number of clusters is sqrt(n/2) which for us is ~69 but the scree plot below is telling us to use 10 clusters.

Since we’re dealing with nominal (categorical) data we cannot use kmeans which calculates euclidean distance and does not work on nominal data. I am using the kmodes clustering algorithm which uses the mode (instead of mean) to calculate a dissimilarity measure which is frequency based. So answers that occur most often will occur in the same clusters.