knitr::opts_chunk$set(echo = TRUE)

1. Problem Statement

This exercise focuses on the businesses that are struggling the most during this crisis, the small businesses. Small companies represent 99% of all businesses in the United States and nearly 50% of employment, making them an essential engine for economic development.

Every year, the U.S. Small Business Administration (SBA) publishes an annual report on small business activity in each state. For this exercise, we will use part of the data published by this public organization and attempt to draw conclusions using Clustering.

You can find the dataset and the data dictionary in the data folder.

First, let’s load the dataset into R:

# Load required packages
require(factoextra)
library(tidyverse)
library(dplyr)
# Load the dataset
df <- read.csv("/Users/arnauandrews/Desktop/UB-Master/M4 - Técnicas Avanzadas de Minería de Datos/Data/4.4_Cluster_ejercicio.csv", dec=".", sep=",", header=TRUE)

colnames(df)
##  [1] "State"                    "PercentSmallBiz"         
##  [3] "PercentEmplySmallBiz"     "AvgEmPerSmallBiz"        
##  [5] "MedIncomeSelfEmpCorp"     "MedIncomeSelfEmpUnincorp"
##  [7] "PercentExportSmallBiz"    "PercentExportValue"      
##  [9] "StateGDPGrowth"           "StateUnemployRate"       
## [11] "ShareManufactEmploy"      "AvgEmpPerWOB"            
## [13] "AvgEmpPerVOB"             "AvgEmpPerMOB"            
## [15] "ShareProSvcEmploy"        "GrowthNumPropiertors"
head(df)
##        State PercentSmallBiz PercentEmplySmallBiz AvgEmPerSmallBiz
## 1    Alabama       0.9936905            0.4746900              2.0
## 2     Alaska       0.9912795            0.5315244              2.0
## 3    Arizona       0.9942976            0.4449635              1.8
## 4   Arkansas       0.9927139            0.4782382              1.9
## 5 California       0.9984648            0.4883776              1.8
## 6   Colorado       0.9946615            0.4861072              1.8
##   MedIncomeSelfEmpCorp MedIncomeSelfEmpUnincorp PercentExportSmallBiz
## 1                48941                    20760                 0.804
## 2                60559                    30817                 0.710
## 3                48698                    20872                 0.874
## 4                45044                    21128                 0.781
## 5                57420                    25034                 0.957
## 6                49977                    24631                 0.875
##   PercentExportValue StateGDPGrowth StateUnemployRate ShareManufactEmploy
## 1              0.155           0.03              0.04                0.34
## 2              0.384           0.02              0.07                0.35
## 3              0.231           0.03              0.05                0.46
## 4              0.283           0.02              0.04                0.26
## 5              0.432           0.03              0.04                0.53
## 6              0.306           0.04              0.03                0.52
##   AvgEmpPerWOB AvgEmpPerVOB AvgEmpPerMOB ShareProSvcEmploy GrowthNumPropiertors
## 1         11.8         10.7          9.5              0.62                 0.02
## 2          5.6          6.5          6.4              0.59                 0.01
## 3          8.3         11.5          9.4              0.56                 0.03
## 4          8.2          7.8          7.7              0.71                 0.02
## 5          7.9         10.1          8.5              0.55                 0.03
## 6          6.1          8.6          7.6              0.57                 0.02

1.1 Dataset Preparation

As you can see, the dataset has 51 observations (corresponding to each of the federal states in the USA) and 16 variables (related to economic data for small businesses). If you need more information about the meaning of each variable, you can find it in the data dictionary.

You will notice that certain variables contain asterisks, which represent missing values (NA). Therefore, we need to handle these variables to apply clustering.

1. Replacing the missing values (asterisks) with the median and transform the treated variables into numeric type.

# Replace "*" with NA
df[df == "*"] <- NA

# Display columns with NA
colSums(is.na(df))
##                    State          PercentSmallBiz     PercentEmplySmallBiz 
##                        0                        0                        0 
##         AvgEmPerSmallBiz     MedIncomeSelfEmpCorp MedIncomeSelfEmpUnincorp 
##                        0                        0                        0 
##    PercentExportSmallBiz       PercentExportValue           StateGDPGrowth 
##                        0                        0                        0 
##        StateUnemployRate      ShareManufactEmploy             AvgEmpPerWOB 
##                        0                        0                        4 
##             AvgEmpPerVOB             AvgEmpPerMOB        ShareProSvcEmploy 
##                        6                        1                        1 
##     GrowthNumPropiertors 
##                        0
# Convert variables to numeric
df$AvgEmpPerWOB <- as.numeric(df$AvgEmpPerWOB)
df$AvgEmpPerVOB <- as.numeric(df$AvgEmpPerVOB)
df$AvgEmpPerMOB <- as.numeric(df$AvgEmpPerMOB)
df$ShareProSvcEmploy <- as.numeric(df$ShareProSvcEmploy)

# Replace missing values with median
df <- df %>% mutate(across(c(AvgEmpPerWOB, AvgEmpPerVOB, AvgEmpPerMOB, ShareProSvcEmploy), ~replace_na(., median(., na.rm=TRUE))))

head(df)
##        State PercentSmallBiz PercentEmplySmallBiz AvgEmPerSmallBiz
## 1    Alabama       0.9936905            0.4746900              2.0
## 2     Alaska       0.9912795            0.5315244              2.0
## 3    Arizona       0.9942976            0.4449635              1.8
## 4   Arkansas       0.9927139            0.4782382              1.9
## 5 California       0.9984648            0.4883776              1.8
## 6   Colorado       0.9946615            0.4861072              1.8
##   MedIncomeSelfEmpCorp MedIncomeSelfEmpUnincorp PercentExportSmallBiz
## 1                48941                    20760                 0.804
## 2                60559                    30817                 0.710
## 3                48698                    20872                 0.874
## 4                45044                    21128                 0.781
## 5                57420                    25034                 0.957
## 6                49977                    24631                 0.875
##   PercentExportValue StateGDPGrowth StateUnemployRate ShareManufactEmploy
## 1              0.155           0.03              0.04                0.34
## 2              0.384           0.02              0.07                0.35
## 3              0.231           0.03              0.05                0.46
## 4              0.283           0.02              0.04                0.26
## 5              0.432           0.03              0.04                0.53
## 6              0.306           0.04              0.03                0.52
##   AvgEmpPerWOB AvgEmpPerVOB AvgEmpPerMOB ShareProSvcEmploy GrowthNumPropiertors
## 1         11.8         10.7          9.5              0.62                 0.02
## 2          5.6          6.5          6.4              0.59                 0.01
## 3          8.3         11.5          9.4              0.56                 0.03
## 4          8.2          7.8          7.7              0.71                 0.02
## 5          7.9         10.1          8.5              0.55                 0.03
## 6          6.1          8.6          7.6              0.57                 0.02

Once the missing value imputation is done, it is necessary to observe the range of variables that will be used for clustering and determine whether they need to be transformed or not.

2. Creating a new dataframe with all standardized variables, where each row is named after a state.

# Add the "States" variable as the row name in the new dataframe
df_copy <- as.data.frame(df)
rownames(df_copy) <- df_copy$State
df_copy <- df_copy[,-1]

# Scale the variables by creating a new dataframe
df_copy <- scale(df_copy)

# Summarize the scaled variables
summary(df_copy)
##  PercentSmallBiz   PercentEmplySmallBiz AvgEmPerSmallBiz   MedIncomeSelfEmpCorp
##  Min.   :-3.3817   Min.   :-1.4818      Min.   :-2.45769   Min.   :-1.67544    
##  1st Qu.:-0.4950   1st Qu.:-0.6590      1st Qu.:-0.65421   1st Qu.:-0.67866    
##  Median : 0.1872   Median :-0.1937      Median :-0.05304   Median :-0.03815    
##  Mean   : 0.0000   Mean   : 0.0000      Mean   : 0.00000   Mean   : 0.00000    
##  3rd Qu.: 0.6152   3rd Qu.: 0.4474      3rd Qu.: 0.54812   3rd Qu.: 0.39076    
##  Max.   : 1.6142   Max.   : 3.0440      Max.   : 3.55392   Max.   : 2.99864    
##  MedIncomeSelfEmpUnincorp PercentExportSmallBiz PercentExportValue
##  Min.   :-1.3557          Min.   :-2.59676      Min.   :-1.5062   
##  1st Qu.:-0.6866          1st Qu.:-0.41868      1st Qu.:-0.6043   
##  Median :-0.3756          Median : 0.03196      Median :-0.3036   
##  Mean   : 0.0000          Mean   : 0.00000      Mean   : 0.0000   
##  3rd Qu.: 0.3869          3rd Qu.: 0.63281      3rd Qu.: 0.3073   
##  Max.   : 3.1839          Max.   : 2.04105      Max.   : 2.9678   
##  StateGDPGrowth     StateUnemployRate ShareManufactEmploy  AvgEmpPerWOB    
##  Min.   :-1.96232   Min.   :-2.1410   Min.   :-1.692200   Min.   :-1.8917  
##  1st Qu.:-0.96154   1st Qu.:-1.1483   1st Qu.:-0.842769   1st Qu.:-0.4773  
##  Median : 0.03925   Median :-0.1557   Median : 0.006662   Median : 0.1182  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.000000   Mean   : 0.0000  
##  3rd Qu.: 1.04003   3rd Qu.: 0.8369   3rd Qu.: 0.516321   3rd Qu.: 0.4160  
##  Max.   : 3.04160   Max.   : 2.8222   Max.   : 2.809785   Max.   : 2.7237  
##   AvgEmpPerVOB       AvgEmpPerMOB      ShareProSvcEmploy  GrowthNumPropiertors
##  Min.   :-1.39697   Min.   :-2.01711   Min.   :-3.40937   Min.   :-2.4626     
##  1st Qu.:-0.72689   1st Qu.:-0.78221   1st Qu.:-0.68417   1st Qu.:-0.3694     
##  Median :-0.09304   Median :-0.03053   Median :-0.05153   Median :-0.3694     
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000     
##  3rd Qu.: 0.37783   3rd Qu.: 0.58692   3rd Qu.: 0.53245   3rd Qu.:-0.3694     
##  Max.   : 4.86914   Max.   : 2.06343   Max.   : 2.72235   Max.   : 1.7238

1.2 Cluster Creation

Now that we have prepared the dataframe, we need to estimate the optimal number of clusters. Although it can be a subjective process, there are some methods that can help us make the decision.

3. Choosing the optimal number of clusters using the elbow method. Any other numbers that could also be optimal?

# Elbow method visualization
fviz_nbclust(x = df_copy, FUNcluster = kmeans, method = "wss", k.max = 15,
             diss = get_dist(df_copy, method = "euclidean"), nstart = 50)

print("According to the elbow method, I believe the optimal number of clusters is 4. However, it can be observed that the curve does not have a well-defined elbow point, where the relationship between the number of clusters and the sum of squared distances between data points and their assigned centroids starts to level off.")
## [1] "According to the elbow method, I believe the optimal number of clusters is 4. However, it can be observed that the curve does not have a well-defined elbow point, where the relationship between the number of clusters and the sum of squared distances between data points and their assigned centroids starts to level off."
print("For this reason, I think values between 4 and 8 clusters could be quite optimal.")
## [1] "For this reason, I think values between 4 and 8 clusters could be quite optimal."

4. Choosing the optimal number of clusters using the dendrogram representation. Do you observe any other numbers that could also be optimal? Justify your answer.

# Dendrogram visualization
set.seed(12345)
hc_complete <- hclust(d = dist(x = df_copy, method = "euclidean"),
                      method = "complete")

fviz_dend(x = hc_complete, cex = 0.5, main = "Complete Linkage",
          sub = "Euclidean Distance") +
  theme(plot.title = element_text(hjust = 0.5, size = 15))
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

print("According to the dendrogram, I observe that the optimal number of clusters is at least 3, as the District of Columbia is clearly separated from the other two major subgroups and has the longest branch, indicating that it differs significantly from the rest of the states.")
## [1] "According to the dendrogram, I observe that the optimal number of clusters is at least 3, as the District of Columbia is clearly separated from the other two major subgroups and has the longest branch, indicating that it differs significantly from the rest of the states."
print("However, I consider that another optimal number of clusters according to the dendrogram (the ideal point where the clusters explain the underlying nature and patterns of the data) is between 4 and 8.")
## [1] "However, I consider that another optimal number of clusters according to the dendrogram (the ideal point where the clusters explain the underlying nature and patterns of the data) is between 4 and 8."

Once the decision on the number of clusters is made, we proceed to create and visualize the clusters in two dimensions.

5. Create the chosen number of clusters and visualize them in two dimensions using the fviz_cluster function.

# Creacion de los clusters
set.seed(12345)
km_clusters <- kmeans(x = df_copy, centers = 4, nstart = 50)

# Visualización de los clusters en 2 dimensiones
fviz_cluster(object = km_clusters, data = df_copy, show.clust.cent = TRUE,
             elipse.type = "euclid", star.plot = TRUE, repel = TRUE) +
  labs(title = "Resultados clustering k-means") +
  theme_bw() +
  theme(legend.position = "none")

Cluster Interpretation

Now that we have the two-dimensional visualization of the states belonging to each created cluster, let’s move on to the interesting part, the interpretation.

Enter your answer here

print(“Illinois, Maine, South Carolina, and District of Columbia are the most representative states of each cluster because they are the closest to each of the respective centroids of the four clusters.”) Choosing which state you consider most representative of each cluster.

# Calculating the distance of each state to its centroid
distances <- sqrt(rowSums((df_copy - fitted(km_clusters))^ 2))
distance <- as.data.frame(distances)

# Creating a classification of each state with its cluster
cluster_cat <-  as.data.frame(km_clusters[1])

# Creating the state variable in the dataframe with states grouped by cluster and distances
cluster_cat <- rownames_to_column(cluster_cat, var = "State")
distance <- rownames_to_column(distance, var = "State")

# Merging the two dataframes
distance <- inner_join(distance, cluster_cat, by = "State")


# Checking the state of each cluster with the lowest distance to its centroid
distance %>% select("State", "distances", "cluster") %>% arrange(distances)
##                   State distances cluster
## 1  District of Columbia  0.000000       4
## 2              Illinois  1.615089       1
## 3        South Carolina  1.677608       3
## 4                Kansas  1.831775       3
## 5              Virginia  1.904677       3
## 6              Maryland  2.034010       1
## 7              Oklahoma  2.073959       3
## 8               Indiana  2.075422       3
## 9                  Iowa  2.172062       3
## 10             Kentucky  2.360737       3
## 11             Colorado  2.379225       1
## 12       North Carolina  2.414516       3
## 13         Pennsylvania  2.416780       1
## 14                Maine  2.430999       2
## 15              Arizona  2.432256       1
## 16           Washington  2.476953       1
## 17             Missouri  2.515505       1
## 18              Alabama  2.516117       3
## 19            Tennessee  2.557161       3
## 20            Minnesota  2.574178       3
## 21        West Virginia  2.610625       3
## 22           California  2.625364       1
## 23                 Ohio  2.647712       3
## 24             Arkansas  2.650440       3
## 25            Wisconsin  2.758175       3
## 26               Oregon  2.785385       1
## 27             Michigan  2.787209       3
## 28        Massachusetts  2.816631       1
## 29                Idaho  2.827351       2
## 30                 Utah  2.834480       1
## 31        New Hampshire  2.850941       2
## 32              Georgia  2.864210       1
## 33            Louisiana  2.971267       3
## 34             New York  2.991384       1
## 35              Vermont  3.136421       2
## 36                Texas  3.175450       1
## 37          Mississippi  3.230139       3
## 38          Connecticut  3.284822       1
## 39              Wyoming  3.407876       2
## 40           New Mexico  3.484061       3
## 41         Rhode Island  3.495238       2
## 42           New Jersey  3.516406       1
## 43              Montana  3.641645       2
## 44         South Dakota  3.886305       2
## 45         North Dakota  3.996394       2
## 46             Nebraska  4.571937       3
## 47              Florida  4.767534       1
## 48             Delaware  4.956830       1
## 49               Hawaii  4.996611       2
## 50               Nevada  5.204733       3
## 51               Alaska  5.692296       2
print("Illinois, Maine, South Carolina, and District of Columbia are the most representative states of each cluster because they are the closest to each of the respective centroids of the four clusters.")
## [1] "Illinois, Maine, South Carolina, and District of Columbia are the most representative states of each cluster because they are the closest to each of the respective centroids of the four clusters."

7. Indexing each of the most representative states of each cluster in the initial dataframe and create a new dataset that only includes these states. Which variables differ the most in value between states?

# Selecting the states to be included in a vector
states <- c("Illinois", "District of Columbia", "South Carolina", "Maine")

# Creating a new dataframe from the initial dataframe with only the selected states
df_select <- subset(df, rownames(df_copy) %in% states)
df_select_stand <- subset(df_copy, rownames(df_copy) %in% states)
df_select_stand <- as.data.frame(df_select_stand)

# Creating a new dataframe with the difference between the maximum and minimum value of each variable
df_diff <- data.frame(apply(df_select_stand, 2, function(x) max(x) - min(x)))
df_diff %>% arrange(desc(apply.df_select_stand..2..function.x..max.x....min.x..))
##                          apply.df_select_stand..2..function.x..max.x....min.x..
## PercentExportValue                                                    4.4739257
## PercentSmallBiz                                                       4.3093460
## MedIncomeSelfEmpCorp                                                  4.2508722
## AvgEmPerSmallBiz                                                      4.2081271
## ShareManufactEmploy                                                   3.7374973
## MedIncomeSelfEmpUnincorp                                              3.4718593
## ShareProSvcEmploy                                                     3.0171961
## StateUnemployRate                                                     2.9778938
## PercentExportSmallBiz                                                 2.4221785
## PercentEmplySmallBiz                                                  2.2155189
## GrowthNumPropiertors                                                  2.0931897
## StateGDPGrowth                                                        2.0015705
## AvgEmpPerMOB                                                          1.8791958
## AvgEmpPerWOB                                                          1.4144210
## AvgEmpPerVOB                                                          0.8330653
# Answer
print("The variables that differ the most between states are: PercentExportValue, PercentSmallBiz, MedIncomeSelfEmpCorp, and AvgEmPerSmallBiz.")
## [1] "The variables that differ the most between states are: PercentExportValue, PercentSmallBiz, MedIncomeSelfEmpCorp, and AvgEmPerSmallBiz."

8. Which of the states selected in the previous exercise has the highest unemployment rate?

# Viewing the StateUnemployRate variable for the selected states
df_select %>% select(State, StateUnemployRate)
##                   State StateUnemployRate
## 9  District of Columbia              0.06
## 14             Illinois              0.05
## 20                Maine              0.03
## 41       South Carolina              0.04
# Answer here
print("The state with the highest unemployment rate is District of Columbia.")
## [1] "The state with the highest unemployment rate is District of Columbia."

9. If the President of the United States wanted to launch a campaign to increase the volume of exports from small businesses to foreign markets in the states that export the least and with lower value, which cluster(s) would you propose? (Mention only the most representative state of the cluster)

# Crear dataframes con el cluster al que pertenece cada estado y la media de cada variable por # Create dataframes with the cluster to which each state belongs and the mean of each variable by cluster
clusters_mean <- as.data.frame(km_clusters[2])

# Add the cluster variable to the original dataframe
df <- inner_join(df, cluster_cat, by = "State")

# To identify the states that export the least and with lower value, identify the states with the lowest "PercentExportValue" which indicates the percentage of export value by small businesses.

df %>% select("State", "cluster", "PercentExportValue") %>% filter(cluster == 3) %>% arrange(PercentExportValue)
##             State cluster PercentExportValue
## 1  South Carolina       3             0.1030
## 2         Alabama       3             0.1550
## 3       Tennessee       3             0.1580
## 4     Mississippi       3             0.1630
## 5          Nevada       3             0.1700
## 6         Indiana       3             0.1740
## 7            Iowa       3             0.1890
## 8          Kansas       3             0.2210
## 9            Ohio       3             0.2240
## 10       Michigan       3             0.2390
## 11       Oklahoma       3             0.2460
## 12  West Virginia       3             0.2540
## 13       Nebraska       3             0.2550
## 14 North Carolina       3             0.2610
## 15       Virginia       3             0.2660
## 16      Wisconsin       3             0.2741
## 17       Arkansas       3             0.2830
## 18      Minnesota       3             0.2940
## 19      Louisiana       3             0.3270
## 20       Kentucky       3             0.3390
## 21     New Mexico       3             0.4320
print("I would propose cluster number 3, as it contains the states with the lowest value of the PercentExportValue variable. In this regard, the state I consider most representative is South Carolina, as it has the lowest PercentExportValue in cluster number 3.")
## [1] "I would propose cluster number 3, as it contains the states with the lowest value of the PercentExportValue variable. In this regard, the state I consider most representative is South Carolina, as it has the lowest PercentExportValue in cluster number 3."

10. Which cluster best represents small businesses with higher income?

# To identify the clusters with states that best represent small businesses with higher income, identify the states with the highest "PercentExportValue" which indicates the percentage of export value by small businesses. 

df %>% select(State, cluster, PercentExportValue) %>% arrange(desc(PercentExportValue))
##                   State cluster PercentExportValue
## 1  District of Columbia       4             0.7950
## 2                Hawaii       2             0.6940
## 3               Montana       2             0.6560
## 4               Wyoming       2             0.6301
## 5          Rhode Island       2             0.6160
## 6               Florida       1             0.5890
## 7              New York       1             0.5700
## 8                 Maine       2             0.5120
## 9                  Utah       1             0.4890
## 10           California       1             0.4320
## 11           New Mexico       3             0.4320
## 12           New Jersey       1             0.4310
## 13               Alaska       2             0.3840
## 14        New Hampshire       2             0.3830
## 15                Texas       1             0.3760
## 16         South Dakota       2             0.3620
## 17              Vermont       2             0.3490
## 18        Massachusetts       1             0.3470
## 19         North Dakota       2             0.3470
## 20             Kentucky       3             0.3390
## 21            Louisiana       3             0.3270
## 22         Pennsylvania       1             0.3100
## 23             Colorado       1             0.3060
## 24             Maryland       1             0.2960
## 25            Minnesota       3             0.2940
## 26              Georgia       1             0.2890
## 27             Arkansas       3             0.2830
## 28            Wisconsin       3             0.2741
## 29               Oregon       1             0.2700
## 30                Idaho       2             0.2680
## 31             Missouri       1             0.2680
## 32             Virginia       3             0.2660
## 33          Connecticut       1             0.2640
## 34       North Carolina       3             0.2610
## 35             Illinois       1             0.2590
## 36             Nebraska       3             0.2550
## 37        West Virginia       3             0.2540
## 38             Oklahoma       3             0.2460
## 39             Michigan       3             0.2390
## 40              Arizona       1             0.2310
## 41                 Ohio       3             0.2240
## 42               Kansas       3             0.2210
## 43           Washington       1             0.1930
## 44             Delaware       1             0.1910
## 45                 Iowa       3             0.1890
## 46              Indiana       3             0.1740
## 47               Nevada       3             0.1700
## 48          Mississippi       3             0.1630
## 49            Tennessee       3             0.1580
## 50              Alabama       3             0.1550
## 51       South Carolina       3             0.1030
# Visualizing the states in the cluster with the highest export percentage value and the highest percentage of small business exporters

df %>% select("State", "cluster", "PercentExportValue") %>% filter(cluster == 4) %>% arrange(PercentExportValue)
##                  State cluster PercentExportValue
## 1 District of Columbia       4              0.795
print("I would propose cluster number 4, as it contains the states with the highest value of the PercentExportValue variable. The only state in the cluster, and therefore the only representative, is District of Columbia.")
## [1] "I would propose cluster number 4, as it contains the states with the highest value of the PercentExportValue variable. The only state in the cluster, and therefore the only representative, is District of Columbia."