Clustering the United States of America

April 14, 2020

Problem Statement

There are 50 states in the United States of America. Each one is made up of unique cities and a different mix of people. These ingredients mean that the make-up of each state is different. But no doubt, certain groups of states are more similar to each other than others.

If states were grouped based on similarity, then two questions should be answered:

How many groups of states?

What is the make-up of each group?

Methodology

This study will be applying unstructured classification techniques to state data since there is no target variable defined. The best technique for this problem is a clustering algorithm called K-Means.

This specific clustering technique is preferred over hierarchical clustering because the number of groups of states is unknown.

Data Sources

Data will be taken from the Census Bureau's Developer's API, which can be accessed at https://www.census.gov/data/developers/data-sets/acs-5year.html. Specifically, this data will come from the American Community Survey (ACS), which is updated annually.

Once on the Developer's API, a query can be completed by changing, adding, or removing parameters to a base URL. One example is found at https://api.census.gov/data/2018/acs/acs5?get=B00001_001E&for=state:01.

Data will be from the 2018 American Community Survey as it is the most recent as of this study.

The data looks like this…

##        State Total.Population Median.Household.Income Avg.Home.Value
## 1    Alabama           387000                   48486         137200
## 2     Alaska            96500                   76715         265200
## 3    Arizona           498000                   56213         209600
## 4   Arkansas           238000                   45726         123300
## 5 California          2815000                   71228         475900
## 6   Colorado           431000                   68811         313600
##   Bachelors.Degree
## 1          3299958
## 2           477727
## 3          4633932
## 4          1999307
## 5         26218885
## 6          3748592

Choosing K-Means Parameters

The ideal number of groups that the K-Means model will look to make needs to be specified, but it can be identified algorithmically by maximizing variability explained while minimizing the number of groups needed.

These two custom functions will help in this process:

runkmCluster <- function(data, centroids){
  kmCluster <- kmeans(data, centroids)
  return(kmCluster)
}

getClusterPerformance <- function(clusterModel){
  clusterPerformance <- clusterModel$betweenss / clusterModel$totss
  return(clusterPerformance)
}

Testing performance with different parameters

km_variability_explained <- NULL

set.seed(43)
kmClusterPerformance <- for (i in 1:5){
  cluster_model <- runkmCluster(data[,2:5], i)
  visualize_performance <- data.frame
  km_variability_explained[i] <- getClusterPerformance(cluster_model)
}

Three groups looks like the right choice

Another test to confirm parameter choice

It appears that two groups would account for most of the variability, but choosing a third will get us slightly better performance (likely because of a few outliers).

Visual of Principal Components

Principal components were created (combinations of multiple attributes) to more easily group states. These components are visualized here:

## K-means clustering with 3 clusters of sizes 29, 18, 4
## 
## Cluster means:
##   Total.Population Median.Household.Income Avg.Home.Value Bachelors.Degree
## 1         202896.6                58961.83       216424.1          1562462
## 2         710222.2                62909.61       217288.9          5600408
## 3        1855000.0                62347.00       284150.0         18081828
## 
## Clustering vector:
##  [1] 1 1 2 1 3 2 1 1 1 3 2 1 1 2 2 1 1 1 1 1 2 2 2 2 1 2 1 1 1 1 2 1 3 2 1
## [36] 2 1 1 2 1 1 1 2 3 1 1 2 1 2 1 2
## 
## Within cluster sum of squares by cluster:
## [1] 2.687410e+13 5.059003e+13 9.924929e+13
##  (between_SS / total_SS =  85.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Run the K-Means Model

# Run the model with parameter of 3 groups
runkmCluster(data[,2:5], 3)

# Add results to original data
clusters <- as.data.frame(runkmCluster(data[,2:5], 3)$cluster)
clusters$state <- data$State
data <- merge(data, clusters, by.x = "State", by.y = "state")
colnames(data) <- c("State", "Total Population", 
                    "Median Household Income", 
                    "Average Home Value", 
                    "Bachelors Degrees", 
                    "Cluster")

## K-means clustering with 3 clusters of sizes 18, 29, 4
## 
## Cluster means:
##   Total.Population Median.Household.Income Avg.Home.Value Bachelors.Degree
## 1         710222.2                62909.61       217288.9          5600408
## 2         202896.6                58961.83       216424.1          1562462
## 3        1855000.0                62347.00       284150.0         18081828
## 
## Clustering vector:
##  [1] 2 2 1 2 3 1 2 2 2 3 1 2 2 1 1 2 2 2 2 2 1 1 1 1 2 1 2 2 2 2 1 2 3 1 2
## [36] 1 2 2 1 2 2 2 1 3 2 2 1 2 1 2 1
## 
## Within cluster sum of squares by cluster:
## [1] 5.059003e+13 2.687410e+13 9.924929e+13
##  (between_SS / total_SS =  85.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

What is the make-up of each cluster?

Total Population

What is the make-up of each cluster?

Median Household Income

What is the make-up of each cluster?

Average Home Value

What is the make-up of each cluster?

Residents with a Bachelor's Degree

The clusters have distinct differences

Analysis of the State Groupings

Group 1 states are middle-performers in income and home value. Distributions of college degree attainment and income are bimodal, as is population.

Group 2 states are low in income and educated residents. These states tend to not be located near coasts (midwest).

Group 3 contains the largest, most populous states – specifically, four states: Texas, California, New York, and Florida.

A spatial view

The midwest makes up a large part of under-performing group 2