knitr::opts_chunk$set(echo = TRUE)
This exercise focuses on the businesses that are struggling the most during this crisis, the small businesses. Small companies represent 99% of all businesses in the United States and nearly 50% of employment, making them an essential engine for economic development.
Every year, the U.S. Small Business Administration (SBA) publishes an annual report on small business activity in each state. For this exercise, we will use part of the data published by this public organization and attempt to draw conclusions using Clustering.
You can find the dataset and the data dictionary in the data folder.
First, let’s load the dataset into R:
# Load required packages
require(factoextra)
library(tidyverse)
library(dplyr)
# Load the dataset
df <- read.csv("/Users/arnauandrews/Desktop/UB-Master/M4 - Técnicas Avanzadas de Minería de Datos/Data/4.4_Cluster_ejercicio.csv", dec=".", sep=",", header=TRUE)
colnames(df)
## [1] "State" "PercentSmallBiz"
## [3] "PercentEmplySmallBiz" "AvgEmPerSmallBiz"
## [5] "MedIncomeSelfEmpCorp" "MedIncomeSelfEmpUnincorp"
## [7] "PercentExportSmallBiz" "PercentExportValue"
## [9] "StateGDPGrowth" "StateUnemployRate"
## [11] "ShareManufactEmploy" "AvgEmpPerWOB"
## [13] "AvgEmpPerVOB" "AvgEmpPerMOB"
## [15] "ShareProSvcEmploy" "GrowthNumPropiertors"
head(df)
## State PercentSmallBiz PercentEmplySmallBiz AvgEmPerSmallBiz
## 1 Alabama 0.9936905 0.4746900 2.0
## 2 Alaska 0.9912795 0.5315244 2.0
## 3 Arizona 0.9942976 0.4449635 1.8
## 4 Arkansas 0.9927139 0.4782382 1.9
## 5 California 0.9984648 0.4883776 1.8
## 6 Colorado 0.9946615 0.4861072 1.8
## MedIncomeSelfEmpCorp MedIncomeSelfEmpUnincorp PercentExportSmallBiz
## 1 48941 20760 0.804
## 2 60559 30817 0.710
## 3 48698 20872 0.874
## 4 45044 21128 0.781
## 5 57420 25034 0.957
## 6 49977 24631 0.875
## PercentExportValue StateGDPGrowth StateUnemployRate ShareManufactEmploy
## 1 0.155 0.03 0.04 0.34
## 2 0.384 0.02 0.07 0.35
## 3 0.231 0.03 0.05 0.46
## 4 0.283 0.02 0.04 0.26
## 5 0.432 0.03 0.04 0.53
## 6 0.306 0.04 0.03 0.52
## AvgEmpPerWOB AvgEmpPerVOB AvgEmpPerMOB ShareProSvcEmploy GrowthNumPropiertors
## 1 11.8 10.7 9.5 0.62 0.02
## 2 5.6 6.5 6.4 0.59 0.01
## 3 8.3 11.5 9.4 0.56 0.03
## 4 8.2 7.8 7.7 0.71 0.02
## 5 7.9 10.1 8.5 0.55 0.03
## 6 6.1 8.6 7.6 0.57 0.02
As you can see, the dataset has 51 observations (corresponding to each of the federal states in the USA) and 16 variables (related to economic data for small businesses). If you need more information about the meaning of each variable, you can find it in the data dictionary.
You will notice that certain variables contain asterisks, which represent missing values (NA). Therefore, we need to handle these variables to apply clustering.
# Replace "*" with NA
df[df == "*"] <- NA
# Display columns with NA
colSums(is.na(df))
## State PercentSmallBiz PercentEmplySmallBiz
## 0 0 0
## AvgEmPerSmallBiz MedIncomeSelfEmpCorp MedIncomeSelfEmpUnincorp
## 0 0 0
## PercentExportSmallBiz PercentExportValue StateGDPGrowth
## 0 0 0
## StateUnemployRate ShareManufactEmploy AvgEmpPerWOB
## 0 0 4
## AvgEmpPerVOB AvgEmpPerMOB ShareProSvcEmploy
## 6 1 1
## GrowthNumPropiertors
## 0
# Convert variables to numeric
df$AvgEmpPerWOB <- as.numeric(df$AvgEmpPerWOB)
df$AvgEmpPerVOB <- as.numeric(df$AvgEmpPerVOB)
df$AvgEmpPerMOB <- as.numeric(df$AvgEmpPerMOB)
df$ShareProSvcEmploy <- as.numeric(df$ShareProSvcEmploy)
# Replace missing values with median
df <- df %>% mutate(across(c(AvgEmpPerWOB, AvgEmpPerVOB, AvgEmpPerMOB, ShareProSvcEmploy), ~replace_na(., median(., na.rm=TRUE))))
head(df)
## State PercentSmallBiz PercentEmplySmallBiz AvgEmPerSmallBiz
## 1 Alabama 0.9936905 0.4746900 2.0
## 2 Alaska 0.9912795 0.5315244 2.0
## 3 Arizona 0.9942976 0.4449635 1.8
## 4 Arkansas 0.9927139 0.4782382 1.9
## 5 California 0.9984648 0.4883776 1.8
## 6 Colorado 0.9946615 0.4861072 1.8
## MedIncomeSelfEmpCorp MedIncomeSelfEmpUnincorp PercentExportSmallBiz
## 1 48941 20760 0.804
## 2 60559 30817 0.710
## 3 48698 20872 0.874
## 4 45044 21128 0.781
## 5 57420 25034 0.957
## 6 49977 24631 0.875
## PercentExportValue StateGDPGrowth StateUnemployRate ShareManufactEmploy
## 1 0.155 0.03 0.04 0.34
## 2 0.384 0.02 0.07 0.35
## 3 0.231 0.03 0.05 0.46
## 4 0.283 0.02 0.04 0.26
## 5 0.432 0.03 0.04 0.53
## 6 0.306 0.04 0.03 0.52
## AvgEmpPerWOB AvgEmpPerVOB AvgEmpPerMOB ShareProSvcEmploy GrowthNumPropiertors
## 1 11.8 10.7 9.5 0.62 0.02
## 2 5.6 6.5 6.4 0.59 0.01
## 3 8.3 11.5 9.4 0.56 0.03
## 4 8.2 7.8 7.7 0.71 0.02
## 5 7.9 10.1 8.5 0.55 0.03
## 6 6.1 8.6 7.6 0.57 0.02
Once the missing value imputation is done, it is necessary to observe the range of variables that will be used for clustering and determine whether they need to be transformed or not.
# Add the "States" variable as the row name in the new dataframe
df_copy <- as.data.frame(df)
rownames(df_copy) <- df_copy$State
df_copy <- df_copy[,-1]
# Scale the variables by creating a new dataframe
df_copy <- scale(df_copy)
# Summarize the scaled variables
summary(df_copy)
## PercentSmallBiz PercentEmplySmallBiz AvgEmPerSmallBiz MedIncomeSelfEmpCorp
## Min. :-3.3817 Min. :-1.4818 Min. :-2.45769 Min. :-1.67544
## 1st Qu.:-0.4950 1st Qu.:-0.6590 1st Qu.:-0.65421 1st Qu.:-0.67866
## Median : 0.1872 Median :-0.1937 Median :-0.05304 Median :-0.03815
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.6152 3rd Qu.: 0.4474 3rd Qu.: 0.54812 3rd Qu.: 0.39076
## Max. : 1.6142 Max. : 3.0440 Max. : 3.55392 Max. : 2.99864
## MedIncomeSelfEmpUnincorp PercentExportSmallBiz PercentExportValue
## Min. :-1.3557 Min. :-2.59676 Min. :-1.5062
## 1st Qu.:-0.6866 1st Qu.:-0.41868 1st Qu.:-0.6043
## Median :-0.3756 Median : 0.03196 Median :-0.3036
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.3869 3rd Qu.: 0.63281 3rd Qu.: 0.3073
## Max. : 3.1839 Max. : 2.04105 Max. : 2.9678
## StateGDPGrowth StateUnemployRate ShareManufactEmploy AvgEmpPerWOB
## Min. :-1.96232 Min. :-2.1410 Min. :-1.692200 Min. :-1.8917
## 1st Qu.:-0.96154 1st Qu.:-1.1483 1st Qu.:-0.842769 1st Qu.:-0.4773
## Median : 0.03925 Median :-0.1557 Median : 0.006662 Median : 0.1182
## Mean : 0.00000 Mean : 0.0000 Mean : 0.000000 Mean : 0.0000
## 3rd Qu.: 1.04003 3rd Qu.: 0.8369 3rd Qu.: 0.516321 3rd Qu.: 0.4160
## Max. : 3.04160 Max. : 2.8222 Max. : 2.809785 Max. : 2.7237
## AvgEmpPerVOB AvgEmpPerMOB ShareProSvcEmploy GrowthNumPropiertors
## Min. :-1.39697 Min. :-2.01711 Min. :-3.40937 Min. :-2.4626
## 1st Qu.:-0.72689 1st Qu.:-0.78221 1st Qu.:-0.68417 1st Qu.:-0.3694
## Median :-0.09304 Median :-0.03053 Median :-0.05153 Median :-0.3694
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.37783 3rd Qu.: 0.58692 3rd Qu.: 0.53245 3rd Qu.:-0.3694
## Max. : 4.86914 Max. : 2.06343 Max. : 2.72235 Max. : 1.7238
Now that we have prepared the dataframe, we need to estimate the optimal number of clusters. Although it can be a subjective process, there are some methods that can help us make the decision.
# Elbow method visualization
fviz_nbclust(x = df_copy, FUNcluster = kmeans, method = "wss", k.max = 15,
diss = get_dist(df_copy, method = "euclidean"), nstart = 50)
print("According to the elbow method, I believe the optimal number of clusters is 4. However, it can be observed that the curve does not have a well-defined elbow point, where the relationship between the number of clusters and the sum of squared distances between data points and their assigned centroids starts to level off.")
## [1] "According to the elbow method, I believe the optimal number of clusters is 4. However, it can be observed that the curve does not have a well-defined elbow point, where the relationship between the number of clusters and the sum of squared distances between data points and their assigned centroids starts to level off."
print("For this reason, I think values between 4 and 8 clusters could be quite optimal.")
## [1] "For this reason, I think values between 4 and 8 clusters could be quite optimal."
# Dendrogram visualization
set.seed(12345)
hc_complete <- hclust(d = dist(x = df_copy, method = "euclidean"),
method = "complete")
fviz_dend(x = hc_complete, cex = 0.5, main = "Complete Linkage",
sub = "Euclidean Distance") +
theme(plot.title = element_text(hjust = 0.5, size = 15))
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
print("According to the dendrogram, I observe that the optimal number of clusters is at least 3, as the District of Columbia is clearly separated from the other two major subgroups and has the longest branch, indicating that it differs significantly from the rest of the states.")
## [1] "According to the dendrogram, I observe that the optimal number of clusters is at least 3, as the District of Columbia is clearly separated from the other two major subgroups and has the longest branch, indicating that it differs significantly from the rest of the states."
print("However, I consider that another optimal number of clusters according to the dendrogram (the ideal point where the clusters explain the underlying nature and patterns of the data) is between 4 and 8.")
## [1] "However, I consider that another optimal number of clusters according to the dendrogram (the ideal point where the clusters explain the underlying nature and patterns of the data) is between 4 and 8."
Once the decision on the number of clusters is made, we proceed to create and visualize the clusters in two dimensions.
# Creacion de los clusters
set.seed(12345)
km_clusters <- kmeans(x = df_copy, centers = 4, nstart = 50)
# Visualización de los clusters en 2 dimensiones
fviz_cluster(object = km_clusters, data = df_copy, show.clust.cent = TRUE,
elipse.type = "euclid", star.plot = TRUE, repel = TRUE) +
labs(title = "Resultados clustering k-means") +
theme_bw() +
theme(legend.position = "none")
Now that we have the two-dimensional visualization of the states belonging to each created cluster, let’s move on to the interesting part, the interpretation.
print(“Illinois, Maine, South Carolina, and District of Columbia are the most representative states of each cluster because they are the closest to each of the respective centroids of the four clusters.”) Choosing which state you consider most representative of each cluster.
# Calculating the distance of each state to its centroid
distances <- sqrt(rowSums((df_copy - fitted(km_clusters))^ 2))
distance <- as.data.frame(distances)
# Creating a classification of each state with its cluster
cluster_cat <- as.data.frame(km_clusters[1])
# Creating the state variable in the dataframe with states grouped by cluster and distances
cluster_cat <- rownames_to_column(cluster_cat, var = "State")
distance <- rownames_to_column(distance, var = "State")
# Merging the two dataframes
distance <- inner_join(distance, cluster_cat, by = "State")
# Checking the state of each cluster with the lowest distance to its centroid
distance %>% select("State", "distances", "cluster") %>% arrange(distances)
## State distances cluster
## 1 District of Columbia 0.000000 4
## 2 Illinois 1.615089 1
## 3 South Carolina 1.677608 3
## 4 Kansas 1.831775 3
## 5 Virginia 1.904677 3
## 6 Maryland 2.034010 1
## 7 Oklahoma 2.073959 3
## 8 Indiana 2.075422 3
## 9 Iowa 2.172062 3
## 10 Kentucky 2.360737 3
## 11 Colorado 2.379225 1
## 12 North Carolina 2.414516 3
## 13 Pennsylvania 2.416780 1
## 14 Maine 2.430999 2
## 15 Arizona 2.432256 1
## 16 Washington 2.476953 1
## 17 Missouri 2.515505 1
## 18 Alabama 2.516117 3
## 19 Tennessee 2.557161 3
## 20 Minnesota 2.574178 3
## 21 West Virginia 2.610625 3
## 22 California 2.625364 1
## 23 Ohio 2.647712 3
## 24 Arkansas 2.650440 3
## 25 Wisconsin 2.758175 3
## 26 Oregon 2.785385 1
## 27 Michigan 2.787209 3
## 28 Massachusetts 2.816631 1
## 29 Idaho 2.827351 2
## 30 Utah 2.834480 1
## 31 New Hampshire 2.850941 2
## 32 Georgia 2.864210 1
## 33 Louisiana 2.971267 3
## 34 New York 2.991384 1
## 35 Vermont 3.136421 2
## 36 Texas 3.175450 1
## 37 Mississippi 3.230139 3
## 38 Connecticut 3.284822 1
## 39 Wyoming 3.407876 2
## 40 New Mexico 3.484061 3
## 41 Rhode Island 3.495238 2
## 42 New Jersey 3.516406 1
## 43 Montana 3.641645 2
## 44 South Dakota 3.886305 2
## 45 North Dakota 3.996394 2
## 46 Nebraska 4.571937 3
## 47 Florida 4.767534 1
## 48 Delaware 4.956830 1
## 49 Hawaii 4.996611 2
## 50 Nevada 5.204733 3
## 51 Alaska 5.692296 2
print("Illinois, Maine, South Carolina, and District of Columbia are the most representative states of each cluster because they are the closest to each of the respective centroids of the four clusters.")
## [1] "Illinois, Maine, South Carolina, and District of Columbia are the most representative states of each cluster because they are the closest to each of the respective centroids of the four clusters."
# Selecting the states to be included in a vector
states <- c("Illinois", "District of Columbia", "South Carolina", "Maine")
# Creating a new dataframe from the initial dataframe with only the selected states
df_select <- subset(df, rownames(df_copy) %in% states)
df_select_stand <- subset(df_copy, rownames(df_copy) %in% states)
df_select_stand <- as.data.frame(df_select_stand)
# Creating a new dataframe with the difference between the maximum and minimum value of each variable
df_diff <- data.frame(apply(df_select_stand, 2, function(x) max(x) - min(x)))
df_diff %>% arrange(desc(apply.df_select_stand..2..function.x..max.x....min.x..))
## apply.df_select_stand..2..function.x..max.x....min.x..
## PercentExportValue 4.4739257
## PercentSmallBiz 4.3093460
## MedIncomeSelfEmpCorp 4.2508722
## AvgEmPerSmallBiz 4.2081271
## ShareManufactEmploy 3.7374973
## MedIncomeSelfEmpUnincorp 3.4718593
## ShareProSvcEmploy 3.0171961
## StateUnemployRate 2.9778938
## PercentExportSmallBiz 2.4221785
## PercentEmplySmallBiz 2.2155189
## GrowthNumPropiertors 2.0931897
## StateGDPGrowth 2.0015705
## AvgEmpPerMOB 1.8791958
## AvgEmpPerWOB 1.4144210
## AvgEmpPerVOB 0.8330653
# Answer
print("The variables that differ the most between states are: PercentExportValue, PercentSmallBiz, MedIncomeSelfEmpCorp, and AvgEmPerSmallBiz.")
## [1] "The variables that differ the most between states are: PercentExportValue, PercentSmallBiz, MedIncomeSelfEmpCorp, and AvgEmPerSmallBiz."
# Viewing the StateUnemployRate variable for the selected states
df_select %>% select(State, StateUnemployRate)
## State StateUnemployRate
## 9 District of Columbia 0.06
## 14 Illinois 0.05
## 20 Maine 0.03
## 41 South Carolina 0.04
# Answer here
print("The state with the highest unemployment rate is District of Columbia.")
## [1] "The state with the highest unemployment rate is District of Columbia."
# Crear dataframes con el cluster al que pertenece cada estado y la media de cada variable por # Create dataframes with the cluster to which each state belongs and the mean of each variable by cluster
clusters_mean <- as.data.frame(km_clusters[2])
# Add the cluster variable to the original dataframe
df <- inner_join(df, cluster_cat, by = "State")
# To identify the states that export the least and with lower value, identify the states with the lowest "PercentExportValue" which indicates the percentage of export value by small businesses.
df %>% select("State", "cluster", "PercentExportValue") %>% filter(cluster == 3) %>% arrange(PercentExportValue)
## State cluster PercentExportValue
## 1 South Carolina 3 0.1030
## 2 Alabama 3 0.1550
## 3 Tennessee 3 0.1580
## 4 Mississippi 3 0.1630
## 5 Nevada 3 0.1700
## 6 Indiana 3 0.1740
## 7 Iowa 3 0.1890
## 8 Kansas 3 0.2210
## 9 Ohio 3 0.2240
## 10 Michigan 3 0.2390
## 11 Oklahoma 3 0.2460
## 12 West Virginia 3 0.2540
## 13 Nebraska 3 0.2550
## 14 North Carolina 3 0.2610
## 15 Virginia 3 0.2660
## 16 Wisconsin 3 0.2741
## 17 Arkansas 3 0.2830
## 18 Minnesota 3 0.2940
## 19 Louisiana 3 0.3270
## 20 Kentucky 3 0.3390
## 21 New Mexico 3 0.4320
print("I would propose cluster number 3, as it contains the states with the lowest value of the PercentExportValue variable. In this regard, the state I consider most representative is South Carolina, as it has the lowest PercentExportValue in cluster number 3.")
## [1] "I would propose cluster number 3, as it contains the states with the lowest value of the PercentExportValue variable. In this regard, the state I consider most representative is South Carolina, as it has the lowest PercentExportValue in cluster number 3."
# To identify the clusters with states that best represent small businesses with higher income, identify the states with the highest "PercentExportValue" which indicates the percentage of export value by small businesses.
df %>% select(State, cluster, PercentExportValue) %>% arrange(desc(PercentExportValue))
## State cluster PercentExportValue
## 1 District of Columbia 4 0.7950
## 2 Hawaii 2 0.6940
## 3 Montana 2 0.6560
## 4 Wyoming 2 0.6301
## 5 Rhode Island 2 0.6160
## 6 Florida 1 0.5890
## 7 New York 1 0.5700
## 8 Maine 2 0.5120
## 9 Utah 1 0.4890
## 10 California 1 0.4320
## 11 New Mexico 3 0.4320
## 12 New Jersey 1 0.4310
## 13 Alaska 2 0.3840
## 14 New Hampshire 2 0.3830
## 15 Texas 1 0.3760
## 16 South Dakota 2 0.3620
## 17 Vermont 2 0.3490
## 18 Massachusetts 1 0.3470
## 19 North Dakota 2 0.3470
## 20 Kentucky 3 0.3390
## 21 Louisiana 3 0.3270
## 22 Pennsylvania 1 0.3100
## 23 Colorado 1 0.3060
## 24 Maryland 1 0.2960
## 25 Minnesota 3 0.2940
## 26 Georgia 1 0.2890
## 27 Arkansas 3 0.2830
## 28 Wisconsin 3 0.2741
## 29 Oregon 1 0.2700
## 30 Idaho 2 0.2680
## 31 Missouri 1 0.2680
## 32 Virginia 3 0.2660
## 33 Connecticut 1 0.2640
## 34 North Carolina 3 0.2610
## 35 Illinois 1 0.2590
## 36 Nebraska 3 0.2550
## 37 West Virginia 3 0.2540
## 38 Oklahoma 3 0.2460
## 39 Michigan 3 0.2390
## 40 Arizona 1 0.2310
## 41 Ohio 3 0.2240
## 42 Kansas 3 0.2210
## 43 Washington 1 0.1930
## 44 Delaware 1 0.1910
## 45 Iowa 3 0.1890
## 46 Indiana 3 0.1740
## 47 Nevada 3 0.1700
## 48 Mississippi 3 0.1630
## 49 Tennessee 3 0.1580
## 50 Alabama 3 0.1550
## 51 South Carolina 3 0.1030
# Visualizing the states in the cluster with the highest export percentage value and the highest percentage of small business exporters
df %>% select("State", "cluster", "PercentExportValue") %>% filter(cluster == 4) %>% arrange(PercentExportValue)
## State cluster PercentExportValue
## 1 District of Columbia 4 0.795
print("I would propose cluster number 4, as it contains the states with the highest value of the PercentExportValue variable. The only state in the cluster, and therefore the only representative, is District of Columbia.")
## [1] "I would propose cluster number 4, as it contains the states with the highest value of the PercentExportValue variable. The only state in the cluster, and therefore the only representative, is District of Columbia."