Classifying U.S. States Based on Business Competitiveness: A Clustering Approach
Author
Saurabh C Srivastava
Published
January 30, 2025
Objective of the Analysis
The objective of this analysis is to perform unsupervised clustering on CNBC’s dataset of top states for doing business in 2024. Using K-Means clustering, the goal is to identify groups of states based on numerical factors that influence business rankings. By determining an optimal number of clusters, we can classify states into meaningful groups that share similar business conditions, allowing for insights into economic competitiveness and business-friendliness across different states.
Takeaways
This clustering analysis provides a data-driven approach to understanding the business landscape across U.S. states. By segmenting states into clusters based on CNBC’s 2024 rankings, we can identify opportunities for investment, economic growth, and policy improvements.
Explanation of the Code
This R script performs the following data preprocessing, clustering, and visualization steps:
1. Load Required Libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(cluster)library(factoextra)
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# We have to only keep numeric values as NbClust requires df to have numeric valuescnbc_clustering <- cnbc %>%select(-overall) %>%column_to_rownames(var ="state")head(cnbc_clustering)
4. Checking if the Data is Clusterable (Hopkins Statistic)
Hopkins statistic is used to check if the dataset has a strong clustering tendency.
A value close to 0 means the data is highly clusterable, while a value near 0.5 suggests weak clustering.
In this case, 0.583 suggests moderate clustering tendency.
# Use the Hopkins Stat(to see if it is "clusterable")hopkins_stat <-get_clust_tendency(cnbc_clustering, n =nrow(cnbc_clustering)-1, graph =FALSE)print(hopkins_stat)
$hopkins_stat
[1] 0.5832097
$plot
NULL
5. Determining the Optimal Number of Clusters
The NbClust package helps determine the optimal number of clusters by evaluating different clustering indices.
The Hartigan index is used to identify the best cluster number.
The optimal number of clusters is stored in n_clusters.
# Determine number of clustersNbClust::NbClust(cnbc_clustering, distance ="euclidean", method ="complete", index ="all")
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 3 proposed 2 as the best number of clusters
* 13 proposed 3 as the best number of clusters
* 1 proposed 4 as the best number of clusters
* 1 proposed 12 as the best number of clusters
* 5 proposed 15 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 3
*******************************************************************
nb =NbClust(cnbc_clustering, method ="complete", index ='hartigan')names(nb)
[1] "All.index" "Best.nc" "Best.partition"
nb$Best.nc # Number of clusters identified
Number_clusters Value_Index
3.0000 7.9851
n_clusters <- nb$Best.nc[1]nb$Best.partition
Virginia North Carolina Texas Georgia Florida
1 1 1 1 1
Minnesota Ohio Tennessee Michigan Washington
2 1 1 1 2
Indiana Arizona Utah Iowa Illinois
1 1 1 1 2
Colorado Pennsylvania Missouri South Carolina Alabama
2 2 1 1 1
Wisconsin New York California Nebraska New Jersey
1 2 2 3 2
Oklahoma Kentucky Oregon Kansas Wyoming
1 3 2 3 3
Maryland Connecticut South Dakota Delaware North Dakota
2 2 3 1 3
Idaho Vermont Massachusetts Nevada West Virginia
3 3 2 1 3
New Hampshire Maine New Mexico Rhode Island Arkansas
3 3 3 3 3
Montana Louisiana Alaska Mississippi Hawaii
3 3 3 3 3
6. Applying K-Means Clustering
# Using kmeans to partitions data points into a predefined number of clusters.set.seed(42)k = stats::kmeans(cnbc_clustering, centers = n_clusters)names(k)
ggplot(data = cnbc_clustering, aes(x = cluster, y =reorder(overall, desc(overall)), label =rownames(cnbc_clustering), col =as.factor(cluster))) +geom_text(vjust =1) +geom_point(aes(size = cluster), alpha =0.2) +geom_point(aes(size = overall), alpha =0.2) +geom_jitter() +labs(y ="Overall Ranking on CNBC Top States for Doing Business",x ="Groupings",title ="Top State for Doing Business 2024, by cluster groups",caption ="Saurabh's Work") +theme(legend.position ="null")