Classifying U.S. States Based on Business Competitiveness: A Clustering Approach

Objective of the Analysis

The objective of this analysis is to perform unsupervised clustering on CNBC’s dataset of top states for doing business in 2024. Using K-Means clustering, the goal is to identify groups of states based on numerical factors that influence business rankings. By determining an optimal number of clusters, we can classify states into meaningful groups that share similar business conditions, allowing for insights into economic competitiveness and business-friendliness across different states.

Takeaways

This clustering analysis provides a data-driven approach to understanding the business landscape across U.S. states. By segmenting states into clusters based on CNBC’s 2024 rankings, we can identify opportunities for investment, economic growth, and policy improvements.

Explanation of the Code

This R script performs the following data preprocessing, clustering, and visualization steps:

1. Load Required Libraries

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(cluster)
library(factoextra)

Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(NbClust)
library(ggplot2)

2. Load and Inspect Data

cnbc = read.csv("cnbc_data_2024.csv", header = T)
sum(is.na(cnbc))

[1] 0

str(cnbc)

'data.frame':   50 obs. of  12 variables:
 $ overall               : int  1 2 3 4 5 6 7 8 9 10 ...
 $ state                 : chr  "Virginia" "North Carolina" "Texas" "Georgia" ...
 $ infra_structure       : int  3 20 26 1 35 5 13 7 24 18 ...
 $ workforce             : int  9 3 1 4 2 19 38 8 18 5 ...
 $ economy               : int  11 4 2 7 1 24 26 3 38 18 ...
 $ quality_of_life       : int  19 32 50 40 38 4 21 45 24 5 ...
 $ cost_of_doing_business: int  24 18 6 23 25 35 2 7 12 39 ...
 $ technology_innovation : int  15 11 1 18 16 12 12 28 9 4 ...
 $ business_friendliness : int  5 2 17 19 28 22 24 19 8 42 ...
 $ education             : int  1 10 35 8 37 17 33 20 41 36 ...
 $ access_to_capital     : int  8 10 6 12 2 12 10 20 15 20 ...
 $ cost_of_living        : int  19 31 37 18 42 19 4 24 3 42 ...

3. Data Cleaning

# We have to only keep numeric values as NbClust requires df to have numeric values
cnbc_clustering <- cnbc %>% 
  select(-overall) %>% 
  column_to_rownames(var = "state")
head(cnbc_clustering)

               infra_structure workforce economy quality_of_life
Virginia                     3         9      11              19
North Carolina              20         3       4              32
Texas                       26         1       2              50
Georgia                      1         4       7              40
Florida                     35         2       1              38
Minnesota                    5        19      24               4
               cost_of_doing_business technology_innovation
Virginia                           24                    15
North Carolina                     18                    11
Texas                               6                     1
Georgia                            23                    18
Florida                            25                    16
Minnesota                          35                    12
               business_friendliness education access_to_capital cost_of_living
Virginia                           5         1                 8             19
North Carolina                     2        10                10             31
Texas                             17        35                 6             37
Georgia                           19         8                12             18
Florida                           28        37                 2             42
Minnesota                         22        17                12             19

4. Checking if the Data is Clusterable (Hopkins Statistic)

Hopkins statistic is used to check if the dataset has a strong clustering tendency.
A value close to 0 means the data is highly clusterable, while a value near 0.5 suggests weak clustering.
In this case, 0.583 suggests moderate clustering tendency.

# Use the Hopkins Stat(to see if it is "clusterable")
hopkins_stat <- get_clust_tendency(cnbc_clustering, n = nrow(cnbc_clustering)-1, graph = FALSE)
print(hopkins_stat)

$hopkins_stat
[1] 0.5832097

$plot
NULL

5. Determining the Optimal Number of Clusters

The NbClust package helps determine the optimal number of clusters by evaluating different clustering indices.
The Hartigan index is used to identify the best cluster number.
The optimal number of clusters is stored in n_clusters.

# Determine number of clusters
NbClust::NbClust(cnbc_clustering, distance = "euclidean", method = "complete", 
                       index = "all")

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot.

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 3 proposed 2 as the best number of clusters 
* 13 proposed 3 as the best number of clusters 
* 1 proposed 4 as the best number of clusters 
* 1 proposed 12 as the best number of clusters 
* 5 proposed 15 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  3 
 
 
*******************************************************************

$All.index
       KL      CH Hartigan    CCC    Scott      Marriot    TrCovW   TraceW
2  0.8232 13.8980  13.1643 6.3950 268.5146 5.962350e+37 130573805 80271.20
3  3.8116 15.1153   5.1792 5.9118 324.9251 4.341345e+37  62356383 62994.56
4  0.5906 12.6390   6.3661 5.6861 373.0928 2.945255e+37  50508311 56741.81
5  1.5082 12.1135   4.5877 5.7027 417.3685 1.898318e+37  33115117 49843.72
6  1.0849 11.3386   4.1512 5.3222 458.9493 1.190051e+37  28759360 45232.37
7  1.0009 10.7814   4.0363 5.0377 503.2775 6.674652e+36  22535780 41332.78
8  1.4320 10.4368   3.0472 4.9203 535.5541 4.571530e+36  18638692 37785.89
9  0.9786  9.9334   3.0131 4.6458 576.4018 2.556044e+36  16340060 35229.89
10 1.0580  9.5740   2.8390 4.4370 608.2294 1.669676e+36  13552624 32818.07
11 1.0475  9.2743   2.7053 4.2368 639.3993 1.083126e+36  11816099 30643.13
12 0.6628  9.0244   3.8266 4.0452 694.7637 4.259583e+35  11138391 28655.41
13 1.1421  9.1763   3.5040 4.2325 745.1927 1.823355e+35   9195041 26033.78
14 1.6138  9.2843   2.3688 4.3586 781.3074 1.026957e+35   7797278 23781.62
15 1.0616  9.0976   2.2512 4.1597 817.6268 5.701799e+34   7162207 22313.37
   Friedman   Rubin Cindex     DB Silhouette   Duda Pseudot2   Beale Ratkowsky
2   93.0911  5.2461 0.4363 1.7677     0.1890 0.6871  13.2077  2.9607    0.2555
3  101.4420  6.6848 0.4698 1.4980     0.2085 0.7505   5.6518  2.1116    0.3506
4  111.9496  7.4215 0.5349 1.5095     0.1748 0.7301   6.2846  2.3480    0.3288
5  121.5594  8.4486 0.4995 1.5325     0.1619 0.5972   5.3948  4.0311    0.3173
6  132.8519  9.3099 0.5045 1.4020     0.1706 0.7421   4.1712  2.1578    0.3042
7  150.2047 10.1883 0.4867 1.3998     0.1670 0.6848   4.6024  2.8137    0.2912
8  157.5283 11.1446 0.5245 1.3880     0.1620 0.4531   4.8283  6.4941    0.2806
9  175.1265 11.9532 0.5383 1.2943     0.1719 0.6811   3.2770  2.7547    0.2699
10 185.6921 12.8316 0.5595 1.3099     0.1748 0.4105   4.3077  7.2424    0.2606
11 200.8483 13.7424 0.5507 1.2501     0.1820 0.7397   2.4639  2.0712    0.2525
12 236.1464 14.6956 0.5354 1.3171     0.1832 0.6559   3.6722  3.0870    0.2449
13 262.4721 16.1755 0.5412 1.2934     0.1936 0.4218   5.4831  7.3747    0.2393
14 274.4999 17.7073 0.5183 1.1660     0.2031 0.3217   4.2161  9.4511    0.2341
15 293.8900 18.8725 0.5178 1.1288     0.2075 1.6200  -1.1481 -1.9302    0.2282
        Ball Ptbiserial   Frey McClain   Dunn Hubert SDindex  Dindex   SDbw
2  40135.598     0.4002 0.1011  0.7541 0.3002      0  0.0868 39.3622 0.7928
3  20998.187     0.5352 0.5497  1.4706 0.3660      0  0.0683 34.9271 0.6439
4  14185.453     0.5270 0.3270  1.8916 0.4201      0  0.0716 33.0463 0.5680
5   9968.743     0.5243 0.2556  2.7043 0.3750      0  0.0734 31.0230 0.5401
6   7538.728     0.5231 0.3991  3.0096 0.3889      0  0.0715 29.4673 0.4733
7   5904.683     0.4997 0.2128  3.8292 0.3972      0  0.0736 28.2843 0.4600
8   4723.237     0.4925 0.2586  4.4763 0.4400      0  0.0695 27.0449 0.4204
9   3914.433     0.4878 0.4052  4.7168 0.4543      0  0.0675 25.9930 0.3685
10  3281.807     0.4664 0.2208  5.4922 0.4789      0  0.0687 25.0446 0.3550
11  2785.739     0.4621 0.2452  5.7517 0.4800      0  0.0682 24.1071 0.3219
12  2387.951     0.4443 0.1759  6.7194 0.4899      0  0.0765 23.3532 0.3201
13  2002.599     0.4244 0.1211  8.1800 0.5258      0  0.0759 22.2675 0.3092
14  1698.687     0.4175 0.2180  8.9106 0.5312      0  0.0679 21.1853 0.2751
15  1487.558     0.4104 0.0944  9.3845 0.5381      0  0.0659 20.4161 0.2502

$All.CriticalValues
   CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
2          0.6899            13.0323       0.0015
3          0.6216            10.3482       0.0260
4          0.6216            10.3482       0.0127
5          0.5025             7.9198       0.0002
6          0.5697             9.0640       0.0249
7          0.5403             8.5077       0.0041
8          0.3763             6.6304       0.0000
9          0.4791             7.6122       0.0064
10         0.3228             6.2930       0.0000
11         0.4791             7.6122       0.0385
12         0.4791             7.6122       0.0026
13         0.3763             6.6304       0.0000
14         0.2504             5.9869       0.0000
15         0.3228             6.2930       1.0000

$Best.nc
                    KL      CH Hartigan   CCC   Scott      Marriot   TrCovW
Number_clusters 3.0000  3.0000   3.0000 2.000  3.0000 4.000000e+00        3
Value_Index     3.8116 15.1153   7.9851 6.395 56.4105 3.491532e+36 68217422
                  TraceW Friedman   Rubin Cindex      DB Silhouette   Duda
Number_clusters     3.00  12.0000  3.0000 2.0000 15.0000     3.0000 3.0000
Value_Index     11023.88  35.2981 -0.7021 0.4363  1.1288     0.2085 0.7505
                PseudoT2   Beale Ratkowsky     Ball PtBiserial Frey McClain
Number_clusters   3.0000 15.0000    3.0000     3.00     3.0000    1  2.0000
Value_Index       5.6518 -1.9302    0.3506 19137.41     0.5352   NA  0.7541
                   Dunn Hubert SDindex Dindex    SDbw
Number_clusters 15.0000      0 15.0000      0 15.0000
Value_Index      0.5381      0  0.0659      0  0.2502

$Best.partition
      Virginia North Carolina          Texas        Georgia        Florida 
             1              1              1              1              1 
     Minnesota           Ohio      Tennessee       Michigan     Washington 
             2              1              1              1              2 
       Indiana        Arizona           Utah           Iowa       Illinois 
             1              1              1              1              2 
      Colorado   Pennsylvania       Missouri South Carolina        Alabama 
             2              2              1              1              1 
     Wisconsin       New York     California       Nebraska     New Jersey 
             1              2              2              3              2 
      Oklahoma       Kentucky         Oregon         Kansas        Wyoming 
             1              3              2              3              3 
      Maryland    Connecticut   South Dakota       Delaware   North Dakota 
             2              2              3              1              3 
         Idaho        Vermont  Massachusetts         Nevada  West Virginia 
             3              3              2              1              3 
 New Hampshire          Maine     New Mexico   Rhode Island       Arkansas 
             3              3              3              3              3 
       Montana      Louisiana         Alaska    Mississippi         Hawaii 
             3              3              3              3              3

nb = NbClust(cnbc_clustering, method = "complete", index = 'hartigan')
names(nb)

[1] "All.index"      "Best.nc"        "Best.partition"

nb$Best.nc # Number of clusters identified

Number_clusters     Value_Index 
         3.0000          7.9851

n_clusters <- nb$Best.nc[1]
nb$Best.partition

      Virginia North Carolina          Texas        Georgia        Florida 
             1              1              1              1              1 
     Minnesota           Ohio      Tennessee       Michigan     Washington 
             2              1              1              1              2 
       Indiana        Arizona           Utah           Iowa       Illinois 
             1              1              1              1              2 
      Colorado   Pennsylvania       Missouri South Carolina        Alabama 
             2              2              1              1              1 
     Wisconsin       New York     California       Nebraska     New Jersey 
             1              2              2              3              2 
      Oklahoma       Kentucky         Oregon         Kansas        Wyoming 
             1              3              2              3              3 
      Maryland    Connecticut   South Dakota       Delaware   North Dakota 
             2              2              3              1              3 
         Idaho        Vermont  Massachusetts         Nevada  West Virginia 
             3              3              2              1              3 
 New Hampshire          Maine     New Mexico   Rhode Island       Arkansas 
             3              3              3              3              3 
       Montana      Louisiana         Alaska    Mississippi         Hawaii 
             3              3              3              3              3

6. Applying K-Means Clustering

# Using kmeans to partitions data points into a predefined number of clusters.
set.seed(42)
k = stats::kmeans(cnbc_clustering, centers = n_clusters)
      
names(k)

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"

cnbc_clustering$cluster = k$cluster
cnbc_clustering$overall = cnbc$overall
head(cnbc_clustering)

               infra_structure workforce economy quality_of_life
Virginia                     3         9      11              19
North Carolina              20         3       4              32
Texas                       26         1       2              50
Georgia                      1         4       7              40
Florida                     35         2       1              38
Minnesota                    5        19      24               4
               cost_of_doing_business technology_innovation
Virginia                           24                    15
North Carolina                     18                    11
Texas                               6                     1
Georgia                            23                    18
Florida                            25                    16
Minnesota                          35                    12
               business_friendliness education access_to_capital cost_of_living
Virginia                           5         1                 8             19
North Carolina                     2        10                10             31
Texas                             17        35                 6             37
Georgia                           19         8                12             18
Florida                           28        37                 2             42
Minnesota                         22        17                12             19
               cluster overall
Virginia             3       1
North Carolina       3       2
Texas                3       3
Georgia              3       4
Florida              3       5
Minnesota            2       6

7. Visualizing Clustering Results

ggplot(data = cnbc_clustering, aes(x = cluster, y = reorder(overall, desc(overall)), 
                              label = rownames(cnbc_clustering), col = as.factor(cluster))) +        
geom_text(vjust = 1) +
geom_point(aes(size = cluster), alpha = 0.2) +
geom_point(aes(size = overall), alpha = 0.2) +
geom_jitter() +
labs(y = "Overall Ranking on CNBC Top States for Doing Business",
     x = "Groupings",
     title = "Top State for Doing Business 2024, by cluster groups",
     caption = "Saurabh's Work") +
     theme(legend.position = "null")