Lab 3: K-Means Clustering

Background Information
Install & Load Required Packages
Step 1: Load Your Data
Step 2: Prepare the Data for Clustering
Step 3: Determine the Optimal Number of Clusters
Step 4: Fit the Clustering Model
Step 5: Add Cluster Results to Original Data
Step 6: Visualize the Clusters
Step 7: Interpret the Results
Chat GPT Reflections
References

Background Information

In Lab 3, we were presented with the idea of clustering our neighborhoods into meaningful groups based on the demographic and socioeconomic characteristics of the residents. We used the provided clustering algorithm to group census tracts in Phoenix, Arizona, into clusters based on various demographic and socioeconomic variables. The goal was to identify patterns and trends in the data that could help us better understand the diversity and disparities within the city. In addition, we visualized the clustering results on a map to see how the different clusters are distributed geographically. In this Explainer Assignment, I will introduce you to another way that data can be clustered using the K-Means Clustering algorithm in R.

I had virtually no experience with K-Means Clustering before this lab, so I was excited to learn more about how it works and how it can be applied to real-world data. I found this assignment to be very informative and engaging, and I was able to follow along with the instructions provided. I was able to successfully run the code and generate the clustering results for the Phoenix census tracts, with minimal frustration using Chat GPT. Anytime I ran into coding difficulty, Chat GPT offered multiple solutions to try to correct my mistakes. Overall, I found this assignment to be a valuable learning experience, and I look forward to applying what I have learned to future projects.

I asked Chat GPT to help me write instructions on how to use the K-Means Clustering algorithm in R. Here is what it came up with: The code is both a combination of the code provided in the original lab 3 and the code generated by Chat GPT.

Install & Load Required Packages

First we need to load the following libraries. These should all be familiar from previous labs, with the exception of ‘factoextra’.

Chat GPT helped me with the labels/explanations and necessary libraries for the required packages.

# Load necessary libraries
library(ggplot2) # For data visualization
library(dplyr)  # For data manipulation
library(cluster)  # For clustering algorithms
library(factoextra)  # For visualizing clustering results
library( geojsonio )   # read shapefiles
library( sp )          # work with shapefiles
library( sf )          # work with shapefiles - simple features format
library( mclust )      # cluster analysis 
library( tmap )        # theme maps
library( ggthemes )
library( pander )      # table formatting

Step 1: Load Your Data

This exercise uses Census data from the 2012 American Communities Survey made available through the Diversity and Disparities Project. We will be using the acs12 dataset, which contains information on various demographic and socioeconomic variables at the census tract level. Specifically, we will be loading a Pheonix Dorling Shapefile and the acs12 data to perform a cluster analysis. This is the same exact dataset that we used for our lab 3 clustering analysis.


# Load the Pheonix Census Tracts Dorling Cartogram

github.url <- "https://raw.githubusercontent.com/DS4PS/cpp-529-master/master/data/phx_dorling.geojson"
phx <- geojson_read( x=github.url,  what="sp" )
plot( phx )

Now, let’s see how we can bring this Pheonix cartogram to life, focusing on median household income and the tmap package. Again, this code was provided in our original lab 3.

# library( tmap )

phx <- spTransform( phx, CRS("+init=epsg:3395") )

bb <- st_bbox( c( xmin = -12519146, xmax = -12421368, 
                  ymax = 3965924, ymin = 3899074 ), 
               crs = st_crs("+init=epsg:3395"))

tm_shape( phx, bbox=bb ) + 
  tm_polygons( col="MHHI", n=10, style="quantile", palette="Spectral" ) +
  tm_layout( "PHX Dorling Cartogram", title.position=c("right","top") )

Here is the provided data library so we can better understand which census variables we are considering in our cluster analysis:

Data Dictionary

LABEL	VARIABLE
tractid	GEOID
pnhwht12	Percent white, non-Hispanic
pnhblk12	Percent black, non-Hispanic
phisp12	Percent Hispanic
pntv12	Percent Native American race
pfb12	Percent foreign born
polang12	Percent speaking other language at home, age 5 plus
phs12	Percent with high school degree or less
pcol12	Percent with 4-year college degree or more
punemp12	Percent unemployed
pflabf12	Percent female labor force participation
pprof12	Percent professional employees
pmanuf12	Percent manufacturing employees
pvet12	Percent veteran
psemp12	Percent self-employed
hinc12	Median HH income, total
incpc12	Per capita income
ppov12	Percent in poverty, total
pown12	Percent owner-occupied units
pvac12	Percent vacant units
pmulti12	Percent multi-family units
mrent12	Median rent
mhmval12	Median home value
p30old12	Percent structures more than 30 years old
p10yrs12	Percent HH in neighborhood 10 years or less
p18und12	Percent 17 and under, total
p60up12	Percent 60 and older, total
p75up12	Percent 75 and older, total
pmar12	Percent currently married, not separated
pwds12	Percent widowed, divorced and separated
pfhh12	Percent female-headed families with children

Let’s make sure our data is loaded correctly by inspecting the first few rows of the phx dataset.


# Inspect the data

head(phx)
## class       : SpatialPolygonsDataFrame 
## features    : 6 
## extent      : -12551892, -12416427, 3958172, 4002664  (xmin, xmax, ymin, ymax)
## crs         : +proj=merc +lon_0=0 +k=1 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs 
## variables   : 44
## names       :     GEOID2,       GEOID, STATEFP, COUNTYFP, TRACTCE,             AFFGEOID,   NAME, LSAD,    ALAND, AWATER,  POP,   MHHI,  ID,  pop.w,    pnhwht12, ... 
## min values  : 4013010101, 04013010101,      04,      013,  010101, 1400000US04013010101, 101.01,   CT, 20520704,  13696, 4243,  40434,  65, 0.4243, 86.58999634, ... 
## max values  : 4013040506, 04013040506,      04,      013,  040506, 1400000US04013040506, 405.06,   CT,  6349981,   7599, 5404, 115725, 438, 0.5404, 99.19000244, ...

Step 2: Prepare the Data for Clustering

Before we can perform clustering, we need to prepare the data by selecting the variables we want to include in the analysis and scaling the data so that all variables are on the same scale.

Check that any missing values are handled. If there are missing values, you can remove them using the na.omit() function.

This was a Chat GPT suggestion to ensure that we are working with clean data.


phx <- na.omit(phx)  # Remove rows with NA values

We are going to extract the data from the shapefile and save it as a separate data frame so we can use it for analysis independent of the shapefile.

d1 <- phx@data
head( d1[,1:6] ) %>% pander()

GEOID2	GEOID	STATEFP	COUNTYFP	TRACTCE	AFFGEOID
4.013e+09	04013010101	04	013	010101	1400000US04013010101
4.013e+09	04013010102	04	013	010102	1400000US04013010102
4.013e+09	04013030401	04	013	030401	1400000US04013030401
4.013e+09	04013030402	04	013	030402	1400000US04013030402
4.013e+09	04013040502	04	013	040502	1400000US04013040502
4.013e+09	04013040506	04	013	040506	1400000US04013040506

Next, we need to select the variables we want to include in the clustering analysis and Scale the data so that all variables have a mean of 0 and a standard deviation of 1. This is important because clustering algorithms are sensitive to the scale of the variables, and scaling the data ensures that all variables are treated equally.

Chat GPT suggested that we use the scale() function to standardize the variables, but this code was provided in the original lab 3.

keep.these <- c("pnhwht12", "pnhblk12", "phisp12", "pntv12", "pfb12", "polang12", 
"phs12", "pcol12", "punemp12", "pflabf12", "pprof12", "pmanuf12", 
"pvet12", "psemp12", "hinc12", "incpc12", "ppov12", "pown12", 
"pvac12", "pmulti12", "mrent12", "mhmval12", "p30old12", "p10yrs12", 
"p18und12", "p60up12", "p75up12", "pmar12", "pwds12", "pfhh12")

d2 <- select( d1, keep.these )
scaled_phx <- apply( d2, 2, scale )

head( scaled_phx[,1:6] ) %>% pander()

pnhwht12	pnhblk12	phisp12	pntv12	pfb12	polang12
1.382	-0.8508	-1.038	-0.2962	-0.6524	-0.9802
1.358	-0.8508	-1.125	-0.22	-0.1142	-0.6567
1.533	-0.8508	-1.153	-0.2962	-0.7297	-1.201
1.24	-0.8508	-1.058	-0.1715	-0.8447	-0.9941
1.054	-0.6265	-0.7904	-0.1733	-0.7813	-0.8673
1.535	-0.8472	-1.168	-0.2962	-0.8854	-1.101

Step 3: Determine the Optimal Number of Clusters

We will be using the Elbow method to find the optimal k value for our k-means clustering algorithm. The Elbow method involves plotting the within-cluster sum of squares (WCSS) for different values of k and identifying the “elbow” point where the rate of decrease in WCSS slows down. This point represents the optimal number of clusters (k).

Chat GPT suggested this elbow method to help us determine the optimal number of clusters for our data. The code was given to me by Chat GPT. All I had to do was change the dataset name to match the one I was working with.


# Elbow method
wss <- (nrow(scaled_phx) - 1) * sum(apply(scaled_phx, 2, var))
for (i in 2:15) {    
  wss[i] <- sum(kmeans(scaled_phx, centers = i)$withinss)}

# Plot the elbow method
plot(1:15, wss, type = "b", main = "Elbow Method for Choosing k",
     xlab = "Number of Clusters", ylab = "Within-cluster Sum of Squares")

Step 4: Fit the Clustering Model

The model suggested 14 optimal clusters based on the elbow method. We will now fit the k-means clustering algorithm to the data using the optimal number of clusters. k = 14.

Not only did Chat GPT provide the code for fitting the clustering model, but it also suggested that we set the seed for reproducibility. This is a good practice to ensure that the results are consistent across different runs of the algorithm. It also gave me the suggested comments, which I included alongside my code.

# Choose the number of clusters based on the elbow plot
set.seed(123)  # For reproducibility
k <- 14  # Replace with the optimal number of clusters from the elbow method
kmeans_result <- kmeans(scaled_phx, centers = 14, nstart = 25)

Step 5: Add Cluster Results to Original Data

Now that we have fit the clustering model to the data, we can add the cluster assignments to the original data so that we can explore the characteristics of each cluster.

This is where I got momentarilly frustrated. Chat GPT gave me the code to create the cluster models, but I wanted to plot the clusters on a map. I had to ask Chat GPT for help with this. This worked perfectly.


phx$cluster <- as.factor(kmeans_result$cluster)

Step 6: Visualize the Clusters

Here’s the fun part: Visualizing the data on our map. I was skeptical that this would work, but used the code provided in the original lab 3. It worked beautifully.

# Clusters on the Map

tmap_mode("plot")
tmap_style("cobalt")
tm_shape( phx, bbox=bb ) + 
  tm_polygons( col="cluster", palette="Accent"  ) +
  labs(title = "K-Means Clustering Results", x = "Feature 1", y = "Feature 2")

WHen I asked Chat GPT for help with the visualization, it suggested that I use the factoextra package to enhance the visualization of the clustering results. WHile, this did not create a beautiful map, I was able to use this package to create a more informative visualization of the clustering results.I can definitely see the value in using this for other projects. It also gave me an understanding of the overlap between the clusters. This made sense because there were many more clusters (14) using K-Means clustering than the 8 clusters in the original lab 3.

# Enhanced visualization with factoextra

fviz_cluster(kmeans_result, data = scaled_phx,             
             geom = "point", ellipse.type = "convex",             
             main = "K-Means Clustering Visualization")

Step 7: Interpret the Results

The clustering algorithm has grouped the census tracts into 14 clusters based on the selected demographic and socioeconomic variables. Each cluster represents a group of census tracts that are similar to each other in terms of these variables. You can now explore the characteristics of each cluster to identify patterns and trends in the data.

I loved how the previous lab 3 provided these useful cluster profile groups so I wanted to compare them with the other clustering model. I used the code provided in the original lab 3 to create these cluster profiles. However, I ran into much difficulty.

Chat GPT helped me work through the code to create the cluster profiles. I took a lot of trial and error, but in the end, me and Chat GPT figured it out.

Cluster Profiles


# Now, group by 'cluster'
set.seed(123)
kmeans_result <- kmeans(d2, centers = 14)  
d2$cluster <- kmeans_result$cluster


stats <- d2 %>%
  group_by(cluster) %>%
  summarise(across(everything(), mean, na.rm = TRUE))  # Calculate mean for all columns

# Prepare data frame for plotting
t <- data.frame(t(stats), stringsAsFactors = FALSE)
names(t) <- paste0("GROUP.", 1:ncol(t))  # Use the number of columns in t
t <- t[-1,]  # Remove the first row if necessary

t <- data.frame( t(stats), stringsAsFactors=F )
names(t) <- paste0( "GROUP.", 1:14 )
t <- t[-1,]

for( i in 1:14 )
{
  z <- t[,i]
  plot( rep(1,30), 1:30, bty="n", xlim=c(-75,100), 
        type="n", xaxt="n", yaxt="n",
        xlab="Percentile", ylab="",
        main=paste("GROUP",i) )
  rect( xleft=0, ybottom=0, xright=20, ytop=(31), col=gray(0.75,0.5), border="gray80" )
  rect( xleft=40, ybottom=0, xright=60, ytop=(31), col=gray(0.75,0.5), border="gray80" )
  rect( xleft=80, ybottom=0, xright=100, ytop=(31), col=gray(0.75,0.5), border="gray80" )
  abline( v=seq(0,100,25), lty=3, lwd=1.5, col="gray90" )
  segments( y0=1:30, x0=0, x1=100, col="gray70", lwd=2 )
  text( -0.2, 1:30, data.dictionary$VARIABLE[-1], cex=0.85, pos=2 )
  points( z, 1:30, pch=19, col="firebrick", cex=1.5 )
  axis( side=1, at=c(0,50,100), col.axis="gray", col="gray" )
}

print(plot) # Print the plot

Chat GPT Reflections

Overall, I found this explainer assignment to be a valuable learning experience, and I look forward to applying what I have learned to future projects. I also learned about K-Means Clustering and how it can be useful. In my city data, I only got 4 clusters so I wonder if my clusters will be more meaningful if this were to be used for Seattle/Tacoma demographics. An A for Chat GPT. I was skeptical about its usefulness for coding because it has not been useful for trying to find resources for research projects but I can’t imagine not using it for all my coding questions in the future.

References

Diversity and Disparities Project
PAF 516 Arizona State University. (2022).