The purpose of this analysis was to create clusters of U.S. states (including Washington, D.C.) that share common features and determine if the standard regional groupings (Midwest, New England, etc.) are truly reflective of states’ characteristics. Before feature selection, the data set contained 35 variables, including Population Density, Percent White, Percent Female, Effective Minimum Wage, 2020 Presidential Election Voter Turnout, Coronavirus Deaths per 100k Population, etc. All data was for the year 2020 only. Data was largely obtained from the 2020 Census and 2020 American Community Survey 1-Year Experimental Estimates, although supplemental sources were also used.
Latitude and Longitude were included as variables to account for certain regional similarities that cannot be quantified (accent, dialect, cuisine, etc.). Analysis results may have been different without the inclusion of Latitude and Longitude.
Before beginning the analysis, variables were assessed for multicollinearity. The table below shows variable pairings with a correlation greater than |0.75|.
# Libraries
library(readxl) #for reading the Excel file
library(tibble) #for filtering the correlation matrix
library(dplyr) #for filtering the correlation matrix
library(tidyr) #for filtering the correlation matrix
library(kohonen) #for SOM functions
library(usmap) #for state mapping
library(ggplot2) #for state mapping
library(kableExtra) #for tables
# Read Data
data <- read_excel('E:/Independent Projects/StateClustering/Master_Dataset.xlsx', sheet="Data")
# Create Correlations
corrs <- round(cor(data[,-4]),3)
# Remove the diagonal and repeat correlations
corrs[!lower.tri(corrs)]<-NA
# Look at the correlations over .75 or under -.75
high_corrs <-
data.frame(corrs) %>%
rownames_to_column() %>%
gather(key="variable", value="correlation", -rowname) %>%
filter(abs(correlation) > 0.75) %>%
arrange(desc(abs(correlation)))
# Create table of High Correlations
kable(high_corrs,
col.names=c("Variable 1", "Variable 2","Correlation"),
longtable = TRUE) %>%
kable_styling(bootstrap_options = c("striped")) %>%
row_spec(0,bold=TRUE)
| Variable 1 |
Variable 2 |
Correlation |
| Housing_Units |
Population |
0.994 |
| Bach_and_Above |
Mdn_Ind_Income |
0.873 |
| Perc_P_Islander |
Perc_Asian |
0.860 |
| Perc_Nvr_Married |
Pop_Density |
0.851 |
| Bach_and_Above |
Perc_Gen_Elect_Dem |
0.845 |
| Area_Water |
Area_Land |
0.833 |
| Perc_wo_Comp_Int |
Mdn_House_Income |
-0.799 |
| Perc_wo_Comp_Int |
Perc_Poverty |
0.798 |
| Bach_and_Above |
Mdn_House_Income |
0.778 |
| Mdn_Ind_Income |
Pop_Density |
0.777 |
| Perc_Nvr_Married |
Perc_Gen_Elect_Dem |
0.767 |
| Mdn_Ind_Income |
Mdn_House_Income |
0.754 |
| Perc_Poverty |
Mdn_House_Income |
-0.754 |
| Perc_Female |
Perc_Black |
0.752 |
Due to high correlations with other variables, Housing Units, Median Individual Income, and Percent Never Married were removed from the data set. Other slightly-less highly correlated variables were not removed. Self Organizing Maps do not require variable independence, so the removal or inclusion of correlated variables is more of an exercise in judgment (Is this variable truly different from that one? Am I measuring the same underlying data twice? Is that underlying data important enough to warrant being measured twice?), than a modelling necessity.
Next, the remaining variables were normalized, by subtracting their means and dividing by their standard deviations. This put all the variables on the same scale and removed any modelling impact from their original, highly disparate scales.
The data was then used to create a Self Organizing Map, using a 4x4 grid and 500 iterations. Training progress was plotted to determine that 500 iterations were sufficient.
# Remove Housing_Units for having a .99 correlation with Population
# Remove Mdn_Ind_Income for having a .87 correlation with Bach_and_Above, as well as a strong correlation w 4 other variables
# Remove Perc_Nvr_Married for having a .85 correlation with Pop_Density, as well as a strong correlation w 3 other variables
# Remove State, since it can't be used in the SOM
data2 <- data[,-c(4,7,29,31)]
# Normalize data by subtracting mean and dividing by s.d.
data2 <- scale(data2)
# Check the summary to verify that scaling took place
dat_summary<-summary(data2)
# SOM ---------------------------------------------------------
# Set a seed
set.seed(8)
# Create a 4x4 grid
som.grid <- somgrid(xdim = 4, ydim = 4, topo = "hexagonal")
# Create a model using our grid and 500 iterations
som.model <- som(data2,
grid = som.grid,
rlen=500,
alpha=c(0.05,0.01),
keep.data = TRUE )
# Check training progress
training_plot<-plot(som.model, type="changes") # Graph levels out, indicating that we don't need more iterations

The model’s node counts and neighbor distances were plotted, and the Within Cluster Sum of Squares were calculated for up to 15 clusters. Node counts and neighbor distances were low, but the problem largely wasn’t mitigated by using a smaller grid. Smaller grids tended to continue to have small distances between neighbors; node counts increased for one or two nodes, but most nodes remained sparse. Therefore, the 4x4 grid was maintained.
# Check node counts
plot(som.model, type="count", main="Node Counts")

# Check each cell's distance from its neighbors
plot(som.model, type="dist.neighbours", main = "SOM neighbour distances")

The plot of the Within Cluster Sum of Squares (below) elbowed at around 5 clusters, but continued to decrease steadily until around 11 clusters. 8 to 11 clusters were tested and mapped. Ultimately, 11 clusters were chosen because Hawaii, Alaska, and Washington, D.C. were consistently placed into single-state clusters by themselves, leaving less clusters for the other 48 states.
# Get the within cluster sum of squares
wss <- (nrow(getCodes(som.model))-1)*sum(apply(getCodes(som.model),2,var))
# Plot the within cluster sum of squares for up to 15 clusters
for (i in 2:15) {
wss[i] <- sum(kmeans(getCodes(som.model), centers=i)$withinss)
}
plot(wss, main = "Within Cluster Sum of Squares by # of Clusters")

The clustered Self Organizing Map grid was produced for the 11 cluster model.
# Look at each variable's heat map by changing k
#k=32
#par(mfrow=c(4,8))
#for (k in 1:32){
#plot(som.model, type = "property", property = getCodes(som.model)[,k],
# main=colnames(getCodes(som.model))[k])
#}
# Use hierarchical clustering to group the vectors into 11 clusters
som.cluster <- cutree(hclust(dist(getCodes(som.model))), 11)
# Plot the clusters - this view uses 2 colors twice
#plot(som.model, type="mapping",
# bgcol = som.cluster,
# main = "Clusters")
# Plot the clusters - this view uses custom colors
plot(som.model, type="mapping",
bgcol = c("aquamarine","beige","cornflowerblue","darkseagreen","lavender","mistyrose2","cornflowerblue","slategray2","cadetblue1","mistyrose2","wheat3","lightpink","gray71","gray71","wheat3","wheat3"),
main = "Clusters")
add.cluster.boundaries(som.model, som.cluster)

The final mapping of the 11 clusters is below. Washington, D.C. is Cluster 2, although it cannot be seen on the national map.
# SOM Mapping ---------------------------------------------------------
# Create a vector of the states' assigned clusters
cluster.assignment <- som.cluster[som.model$unit.classif]
# Add assigned clusters to the dataset
data$cluster <- cluster.assignment
# Create map data
map_data <- data[,c(4,37)]
map_data$fips <- fips(map_data$State)
map_data <- map_data[,c(3,2)]
map_data$cluster <- as.factor(map_data$cluster)
# Create the map
plot_usmap(regions="states", data = map_data, values = "cluster", color = "black") +
scale_fill_discrete(name = "Cluster") +
theme(legend.position = "right")

The unique features of the clusters were then assessed using box plots (not shown; see code below).
# Assess clusters ---------------------------------------------------------
# Put data in a data frame, remove variables that weren't used for analysis
data<-as.data.frame(data[,-c(7,29,31)])
# Get cluster counts
data %>%
count(cluster)
# Create list of states by cluster
state_list <-
data %>%
group_by(cluster) %>%
count(State)
# Create boxplots for each variable
for (n in c(1:3,5:33)){
boxplot(data[,n]~data$cluster,
ylab=colnames(data)[n],
main=colnames(data)[n])
}
The final clusters were:
Cluster 1 - Hawaii
Hawaii has the highest water to land area ratio, as well as the highest mixed race and Asian populations, low incarceration and poverty, and a high percentage of residents with a High School education or above.
Cluster 2 - Washington DC
D.C. has extremely high population density and percent of residents with a Bachelors degree or above, as well as a high minimum wage, high unemployment, and the largest % female population.
Cluster 3 - Connecticut, Delaware, Maryland, Massachusetts, New Jersey, Rhode Island, Virginia
These states are higher income and have a high percent of residents with a Bachelors degree or above. Their population densities are above average and they experienced above average Coronavirus deaths per 100,000 population in 2020.
Cluster 4 - Maine, New Hampshire, Vermont
These three states are distinct from the rest of New England. They are small states with small, largely white populations. They have the highest percent vacant housing of any cluster, but the lowest percent of residents who considered themselves very religious.
Cluster 5 - Arizona, California, Nevada, New Mexico, Texas
These large, South-Western states have above average areas and the highest amounts of Hispanic residents. These states have been growing in population, and this cluster has the lowest percent of residents with a High School education or above.
Cluster 6 - Illinois, Michigan, New York, Ohio, Pennsylvania
This cluster is extremely average in almost every measure, other than population change since 2010. These states have grown more slowly than average over the past 10 years.
Cluster 7 - Colorado, Oregon, Washington
These three Western states have above average Native and Pacific Islander populations and high minimum wages. They have been growing rapidly and have high voter turnout.
Cluster 8 - Florida, Georgia, North Carolina
These three Southern states have a higher than average percent black residents. They also have low minimum wages and low median household incomes. An above average amount of their residents report being very religious.
Cluster 9 - Idaho, Indiana, Iowa, Kansas, Minnesota, Missouri, Montana, Nebraska, North Dakota, South Dakota, Utah, Wisconsin, Wyoming
This cluster had the lowest minimum wages, as well as high positive COVID tests per 100,000. A large percent of their populations has a High School education or above. They are predominantly white, with above average Native populations.
Cluster 10 - Alaska
Alaska has the highest land area and a small population. It experienced high unemployment in 2020 and has the lowest percent female population of any state. It also has a large amount of vacant housing and a large Pacific Islander population.
Cluster 11 - Alabama, Arkansas, Kentucky, Louisiana, Mississippi, Oklahoma, South Carolina, Tennessee, West Virginia
Cluster 11 has the highest percent of its population incarcerated. It has low minimum wages and the lowest percent of population with a Bachelors degree and above. It is the cluster with the greatest proportion of individuals reporting themselves as very religious, as well as the greatest proportion of individuals without access to an internet-connected computer.
---
title: "U.S. State Clustering"
output:
  html_notebook:
      code_folding: hide
---

The purpose of this analysis was to create clusters of U.S. states (including Washington, D.C.) that share common features and determine if the standard regional groupings (Midwest, New England, etc.) are truly reflective of states' characteristics. Before feature selection, the data set contained 35 variables, including Population Density, Percent White, Percent Female, Effective Minimum Wage, 2020 Presidential Election Voter Turnout, Coronavirus Deaths per 100k Population, etc. All data was for the year 2020 only. Data was largely obtained from the 2020 Census and 2020 American Community Survey 1-Year Experimental Estimates, although supplemental sources were also used.

Latitude and Longitude were included as variables to account for certain regional similarities that cannot be quantified (accent, dialect, cuisine, etc.). Analysis results may have been different without the inclusion of Latitude and Longitude.

Before beginning the analysis, variables were assessed for multicollinearity. The table below shows variable pairings with a correlation greater than |0.75|.

```{r include=TRUE}
# Libraries
library(readxl)  #for reading the Excel file
library(tibble)  #for filtering the correlation matrix
library(dplyr)   #for filtering the correlation matrix
library(tidyr)   #for filtering the correlation matrix
library(kohonen) #for SOM functions
library(usmap)   #for state mapping
library(ggplot2) #for state mapping
library(kableExtra) #for tables


# Read Data
data <- read_excel('E:/Independent Projects/StateClustering/Master_Dataset.xlsx', sheet="Data")


# Create Correlations
corrs <- round(cor(data[,-4]),3)


# Remove the diagonal and repeat correlations
corrs[!lower.tri(corrs)]<-NA 


# Look at the correlations over .75 or under -.75
  high_corrs <-
  data.frame(corrs) %>%
  rownames_to_column() %>%
  gather(key="variable", value="correlation", -rowname) %>%
  filter(abs(correlation) > 0.75) %>%
  arrange(desc(abs(correlation)))
  
  
# Create table of High Correlations 
  kable(high_corrs, 
      col.names=c("Variable 1", "Variable 2","Correlation"), 
      longtable = TRUE) %>%
  kable_styling(bootstrap_options = c("striped")) %>% 
        row_spec(0,bold=TRUE) 
```


Due to high correlations with other variables, Housing Units, Median Individual Income, and Percent Never Married were removed from the data set. Other slightly-less highly correlated variables were not removed. Self Organizing Maps do not require variable independence, so the removal or inclusion of correlated variables is more of an exercise in judgment (Is this variable truly different from that one? Am I measuring the same underlying data twice? Is that underlying data important enough to warrant being measured twice?), than a modelling necessity.

Next, the remaining variables were normalized, by subtracting their means and dividing by their standard deviations. This put all the variables on the same scale and removed any modelling impact from their original, highly disparate scales.

The data was then used to create a Self Organizing Map, using a 4x4 grid and 500 iterations. Training progress was plotted to determine that 500 iterations were sufficient.


```{r}

# Remove Housing_Units for having a .99 correlation with Population
# Remove Mdn_Ind_Income for having a .87 correlation with Bach_and_Above, as well as a strong correlation w 4 other variables
# Remove Perc_Nvr_Married for having a .85 correlation with Pop_Density, as well as a strong correlation w 3 other variables
# Remove State, since it can't be used in the SOM
data2 <- data[,-c(4,7,29,31)]


# Normalize data by subtracting mean and dividing by s.d.
data2 <- scale(data2)


# Check the summary to verify that scaling took place
dat_summary<-summary(data2)



# SOM ---------------------------------------------------------


# Set a seed
set.seed(8)


# Create a 4x4 grid
som.grid <- somgrid(xdim = 4, ydim = 4, topo = "hexagonal")


# Create a model using our grid and 500 iterations
som.model <- som(data2, 
                 grid = som.grid, 
                 rlen=500, 
                 alpha=c(0.05,0.01), 
                 keep.data = TRUE )


# Check training progress
training_plot<-plot(som.model, type="changes")  # Graph levels out, indicating that we don't need more iterations
```


The model's node counts and neighbor distances were plotted, and the Within Cluster Sum of Squares were calculated for up to 15 clusters. Node counts and neighbor distances were low, but the problem largely wasn't mitigated by using a smaller grid. Smaller grids tended to continue to have small distances between neighbors; node counts increased for one or two nodes, but most nodes remained sparse. Therefore, the 4x4 grid was maintained. 



```{r}
# Check node counts
plot(som.model, type="count", main="Node Counts")
```

```{r}
# Check each cell's distance from its neighbors
plot(som.model, type="dist.neighbours", main = "SOM neighbour distances")
```

The plot of the Within Cluster Sum of Squares (below) elbowed at around 5 clusters, but continued to decrease steadily until around 11 clusters. 8 to 11 clusters were tested and mapped. Ultimately, 11 clusters were chosen because Hawaii, Alaska, and Washington, D.C. were consistently placed into single-state clusters by themselves, leaving less clusters for the other 48 states.

```{r}
# Get the within cluster sum of squares 
wss <- (nrow(getCodes(som.model))-1)*sum(apply(getCodes(som.model),2,var)) 

# Plot the within cluster sum of squares for up to 15 clusters
for (i in 2:15) {
  wss[i] <- sum(kmeans(getCodes(som.model), centers=i)$withinss)
}
plot(wss, main = "Within Cluster Sum of Squares by # of Clusters")
```


The clustered Self Organizing Map grid was produced for the 11 cluster model. 
```{r}
# Look at each variable's heat map by changing k
#k=32
#par(mfrow=c(4,8))
#for (k in 1:32){
#plot(som.model, type = "property", property = getCodes(som.model)[,k], 
#     main=colnames(getCodes(som.model))[k])
#}

# Use hierarchical clustering to group the vectors into 11 clusters
som.cluster <- cutree(hclust(dist(getCodes(som.model))), 11)


# Plot the clusters - this view uses 2 colors twice
#plot(som.model, type="mapping", 
#     bgcol = som.cluster,
#     main = "Clusters") 


# Plot the clusters - this view uses custom colors
plot(som.model, type="mapping", 
     bgcol = c("aquamarine","beige","cornflowerblue","darkseagreen","lavender","mistyrose2","cornflowerblue","slategray2","cadetblue1","mistyrose2","wheat3","lightpink","gray71","gray71","wheat3","wheat3"),
     main = "Clusters") 
add.cluster.boundaries(som.model, som.cluster)
```


The final mapping of the 11 clusters is below. Washington, D.C. is Cluster 2, although it cannot be seen on the national map.

```{r}
# SOM Mapping ---------------------------------------------------------


# Create a vector of the states' assigned clusters
cluster.assignment <- som.cluster[som.model$unit.classif]


# Add assigned clusters to the dataset
data$cluster <- cluster.assignment


# Create map data
map_data <- data[,c(4,37)]
map_data$fips <- fips(map_data$State)
map_data <- map_data[,c(3,2)]
map_data$cluster <- as.factor(map_data$cluster)


# Create the map
plot_usmap(regions="states", data = map_data, values = "cluster", color = "black") + 
 scale_fill_discrete(name = "Cluster") + 
    theme(legend.position = "right") 
```


The unique features of the clusters were then assessed using box plots (not shown; see code below).
```{r}
# Assess clusters ---------------------------------------------------------


# Put data in a data frame, remove variables that weren't used for analysis
data<-as.data.frame(data[,-c(7,29,31)])


# Get cluster counts
data %>% 
  count(cluster)


# Create list of states by cluster
state_list <-
   data %>%
   group_by(cluster) %>%
   count(State)


# Create boxplots for each variable
for (n in c(1:3,5:33)){
    boxplot(data[,n]~data$cluster, 
          ylab=colnames(data)[n],
          main=colnames(data)[n])
}
```

The final clusters were:


<b>Cluster 1 - Hawaii</b>

 Hawaii has the highest water to land area ratio, as well as the highest mixed race and Asian populations, low incarceration and poverty, and a high percentage of residents with a High School education or above.

<br/>

<b>Cluster 2 - Washington DC</b>

 D.C. has extremely high population density and percent of residents with a Bachelors degree or above, as well as a high minimum wage, high unemployment, and the largest % female population.

<br/>

<b>Cluster 3 - Connecticut, Delaware, Maryland, Massachusetts, New Jersey, Rhode Island, Virginia</b>

 These states are higher income and have a high percent of residents with a Bachelors degree or above. Their population densities are above average and they experienced above average Coronavirus deaths per 100,000 population in 2020.

<br/>

<b>Cluster 4 - Maine, New Hampshire, Vermont</b>

 These three states are distinct from the rest of New England. They are small states with small, largely white populations. They have the highest percent vacant housing of any cluster, but the lowest percent of residents who considered themselves very religious.

<br/>

<b>Cluster 5 - Arizona, California, Nevada, New Mexico, Texas</b>

 These large, South-Western states have above average areas and the highest amounts of Hispanic residents. These states have been growing in population, and this cluster has the lowest percent of residents with a High School education or above.

<br/>

<b>Cluster 6 - Illinois, Michigan, New York, Ohio, Pennsylvania</b>

 This cluster is extremely average in almost every measure, other than population change since 2010. These states have grown more slowly than average over the past 10 years. 

<br/>

<b>Cluster 7 - Colorado, Oregon, Washington</b>

 These three Western states have above average Native and Pacific Islander populations and high minimum wages. They have been growing rapidly and have high voter turnout. 

<br/>

<b>Cluster 8 - Florida, Georgia, North Carolina</b>

 These three Southern states have a higher than average percent black residents. They also have low minimum wages and low median household incomes. An above average amount of their residents report being very religious.

<br/>

<b>Cluster 9 - Idaho, Indiana, Iowa, Kansas, Minnesota, Missouri, Montana, Nebraska, North Dakota, South Dakota, Utah, Wisconsin, Wyoming</b>

 This cluster had the lowest minimum wages, as well as high positive COVID tests per 100,000. A large percent of their populations has a High School education or above. They are predominantly white, with above average Native populations.

<br/>

<b>Cluster 10 - Alaska</b>

 Alaska has the highest land area and a small population. It experienced high unemployment in 2020 and has the lowest percent female population of any state. It also has a large amount of vacant housing and a large Pacific Islander population.

<br/>

<b>Cluster 11 - Alabama, Arkansas, Kentucky, Louisiana, Mississippi, Oklahoma, South Carolina, Tennessee, West Virginia</b>

 Cluster 11 has the highest percent of its population incarcerated. It has low minimum wages and the lowest percent of population with a Bachelors degree and above. It is the cluster with the greatest proportion of individuals reporting themselves as very religious, as well as the greatest proportion of individuals without access to an internet-connected computer.

<br/>



