In Lab 3, we were presented with the idea of clustering our neighborhoods into meaningful groups based on the demographic and socioeconomic characteristics of the residents. We used the provided clustering algorithm to group census tracts in Phoenix, Arizona, into clusters based on various demographic and socioeconomic variables. The goal was to identify patterns and trends in the data that could help us better understand the diversity and disparities within the city. In addition, we visualized the clustering results on a map to see how the different clusters are distributed geographically. In this Explainer Assignment, I will introduce you to another way that data can be clustered using the K-Means Clustering algorithm in R.
I had virtually no experience with K-Means Clustering before this lab, so I was excited to learn more about how it works and how it can be applied to real-world data. I found this assignment to be very informative and engaging, and I was able to follow along with the instructions provided. I was able to successfully run the code and generate the clustering results for the Phoenix census tracts, with minimal frustration using Chat GPT. Anytime I ran into coding difficulty, Chat GPT offered multiple solutions to try to correct my mistakes. Overall, I found this assignment to be a valuable learning experience, and I look forward to applying what I have learned to future projects.
I asked Chat GPT to help me write instructions on how to use the K-Means Clustering algorithm in R. Here is what it came up with: The code is both a combination of the code provided in the original lab 3 and the code generated by Chat GPT.
First we need to load the following libraries. These should all be familiar from previous labs, with the exception of ‘factoextra’.
Chat GPT helped me with the labels/explanations and necessary libraries for the required packages.
# Load necessary libraries
library(ggplot2) # For data visualization
library(dplyr) # For data manipulation
library(cluster) # For clustering algorithms
library(factoextra) # For visualizing clustering results
library( geojsonio ) # read shapefiles
library( sp ) # work with shapefiles
library( sf ) # work with shapefiles - simple features format
library( mclust ) # cluster analysis
library( tmap ) # theme maps
library( ggthemes )
library( pander ) # table formattingThis exercise uses Census data from the 2012 American Communities
Survey made available through the Diversity
and Disparities Project. We will be using the acs12
dataset, which contains information on various demographic and
socioeconomic variables at the census tract level. Specifically, we will
be loading a Pheonix Dorling Shapefile and the acs12 data to perform a
cluster analysis. This is the same exact dataset that we used for our
lab 3 clustering analysis.
# Load the Pheonix Census Tracts Dorling Cartogram
github.url <- "https://raw.githubusercontent.com/DS4PS/cpp-529-master/master/data/phx_dorling.geojson"
phx <- geojson_read( x=github.url, what="sp" )
plot( phx )Now, let’s see how we can bring this Pheonix cartogram to life, focusing on median household income and the tmap package. Again, this code was provided in our original lab 3.
# library( tmap )
phx <- spTransform( phx, CRS("+init=epsg:3395") )
bb <- st_bbox( c( xmin = -12519146, xmax = -12421368,
ymax = 3965924, ymin = 3899074 ),
crs = st_crs("+init=epsg:3395"))
tm_shape( phx, bbox=bb ) +
tm_polygons( col="MHHI", n=10, style="quantile", palette="Spectral" ) +
tm_layout( "PHX Dorling Cartogram", title.position=c("right","top") )Here is the provided data library so we can better understand which census variables we are considering in our cluster analysis:
| LABEL | VARIABLE |
|---|---|
| tractid | GEOID |
| pnhwht12 | Percent white, non-Hispanic |
| pnhblk12 | Percent black, non-Hispanic |
| phisp12 | Percent Hispanic |
| pntv12 | Percent Native American race |
| pfb12 | Percent foreign born |
| polang12 | Percent speaking other language at home, age 5 plus |
| phs12 | Percent with high school degree or less |
| pcol12 | Percent with 4-year college degree or more |
| punemp12 | Percent unemployed |
| pflabf12 | Percent female labor force participation |
| pprof12 | Percent professional employees |
| pmanuf12 | Percent manufacturing employees |
| pvet12 | Percent veteran |
| psemp12 | Percent self-employed |
| hinc12 | Median HH income, total |
| incpc12 | Per capita income |
| ppov12 | Percent in poverty, total |
| pown12 | Percent owner-occupied units |
| pvac12 | Percent vacant units |
| pmulti12 | Percent multi-family units |
| mrent12 | Median rent |
| mhmval12 | Median home value |
| p30old12 | Percent structures more than 30 years old |
| p10yrs12 | Percent HH in neighborhood 10 years or less |
| p18und12 | Percent 17 and under, total |
| p60up12 | Percent 60 and older, total |
| p75up12 | Percent 75 and older, total |
| pmar12 | Percent currently married, not separated |
| pwds12 | Percent widowed, divorced and separated |
| pfhh12 | Percent female-headed families with children |
Let’s make sure our data is loaded correctly by inspecting the first
few rows of the phx dataset.
# Inspect the data
head(phx)
## class : SpatialPolygonsDataFrame
## features : 6
## extent : -12551892, -12416427, 3958172, 4002664 (xmin, xmax, ymin, ymax)
## crs : +proj=merc +lon_0=0 +k=1 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs
## variables : 44
## names : GEOID2, GEOID, STATEFP, COUNTYFP, TRACTCE, AFFGEOID, NAME, LSAD, ALAND, AWATER, POP, MHHI, ID, pop.w, pnhwht12, ...
## min values : 4013010101, 04013010101, 04, 013, 010101, 1400000US04013010101, 101.01, CT, 20520704, 13696, 4243, 40434, 65, 0.4243, 86.58999634, ...
## max values : 4013040506, 04013040506, 04, 013, 040506, 1400000US04013040506, 405.06, CT, 6349981, 7599, 5404, 115725, 438, 0.5404, 99.19000244, ...Before we can perform clustering, we need to prepare the data by selecting the variables we want to include in the analysis and scaling the data so that all variables are on the same scale.
Check that any missing values are handled. If there are missing
values, you can remove them using the na.omit()
function.
This was a Chat GPT suggestion to ensure that we are working with clean data.
We are going to extract the data from the shapefile and save it as a separate data frame so we can use it for analysis independent of the shapefile.
| GEOID2 | GEOID | STATEFP | COUNTYFP | TRACTCE | AFFGEOID |
|---|---|---|---|---|---|
| 4.013e+09 | 04013010101 | 04 | 013 | 010101 | 1400000US04013010101 |
| 4.013e+09 | 04013010102 | 04 | 013 | 010102 | 1400000US04013010102 |
| 4.013e+09 | 04013030401 | 04 | 013 | 030401 | 1400000US04013030401 |
| 4.013e+09 | 04013030402 | 04 | 013 | 030402 | 1400000US04013030402 |
| 4.013e+09 | 04013040502 | 04 | 013 | 040502 | 1400000US04013040502 |
| 4.013e+09 | 04013040506 | 04 | 013 | 040506 | 1400000US04013040506 |
Next, we need to select the variables we want to include in the clustering analysis and Scale the data so that all variables have a mean of 0 and a standard deviation of 1. This is important because clustering algorithms are sensitive to the scale of the variables, and scaling the data ensures that all variables are treated equally.
Chat GPT suggested that we use the
scale()function to standardize the variables, but this code was provided in the original lab 3.
keep.these <- c("pnhwht12", "pnhblk12", "phisp12", "pntv12", "pfb12", "polang12",
"phs12", "pcol12", "punemp12", "pflabf12", "pprof12", "pmanuf12",
"pvet12", "psemp12", "hinc12", "incpc12", "ppov12", "pown12",
"pvac12", "pmulti12", "mrent12", "mhmval12", "p30old12", "p10yrs12",
"p18und12", "p60up12", "p75up12", "pmar12", "pwds12", "pfhh12")
d2 <- select( d1, keep.these )
scaled_phx <- apply( d2, 2, scale )
head( scaled_phx[,1:6] ) %>% pander()| pnhwht12 | pnhblk12 | phisp12 | pntv12 | pfb12 | polang12 |
|---|---|---|---|---|---|
| 1.382 | -0.8508 | -1.038 | -0.2962 | -0.6524 | -0.9802 |
| 1.358 | -0.8508 | -1.125 | -0.22 | -0.1142 | -0.6567 |
| 1.533 | -0.8508 | -1.153 | -0.2962 | -0.7297 | -1.201 |
| 1.24 | -0.8508 | -1.058 | -0.1715 | -0.8447 | -0.9941 |
| 1.054 | -0.6265 | -0.7904 | -0.1733 | -0.7813 | -0.8673 |
| 1.535 | -0.8472 | -1.168 | -0.2962 | -0.8854 | -1.101 |
We will be using the Elbow method to find the optimal k value for our k-means clustering algorithm. The Elbow method involves plotting the within-cluster sum of squares (WCSS) for different values of k and identifying the “elbow” point where the rate of decrease in WCSS slows down. This point represents the optimal number of clusters (k).
Chat GPT suggested this elbow method to help us determine the optimal number of clusters for our data. The code was given to me by Chat GPT. All I had to do was change the dataset name to match the one I was working with.
# Elbow method
wss <- (nrow(scaled_phx) - 1) * sum(apply(scaled_phx, 2, var))
for (i in 2:15) {
wss[i] <- sum(kmeans(scaled_phx, centers = i)$withinss)}
# Plot the elbow method
plot(1:15, wss, type = "b", main = "Elbow Method for Choosing k",
xlab = "Number of Clusters", ylab = "Within-cluster Sum of Squares")The model suggested 14 optimal clusters based on the elbow method. We will now fit the k-means clustering algorithm to the data using the optimal number of clusters. k = 14.
Not only did Chat GPT provide the code for fitting the clustering model, but it also suggested that we set the seed for reproducibility. This is a good practice to ensure that the results are consistent across different runs of the algorithm. It also gave me the suggested comments, which I included alongside my code.
Now that we have fit the clustering model to the data, we can add the cluster assignments to the original data so that we can explore the characteristics of each cluster.
This is where I got momentarilly frustrated. Chat GPT gave me the code to create the cluster models, but I wanted to plot the clusters on a map. I had to ask Chat GPT for help with this. This worked perfectly.
Here’s the fun part: Visualizing the data on our map. I was skeptical that this would work, but used the code provided in the original lab 3. It worked beautifully.
# Clusters on the Map
tmap_mode("plot")
tmap_style("cobalt")
tm_shape( phx, bbox=bb ) +
tm_polygons( col="cluster", palette="Accent" ) +
labs(title = "K-Means Clustering Results", x = "Feature 1", y = "Feature 2")WHen I asked Chat GPT for help with the visualization, it suggested that I use the
factoextrapackage to enhance the visualization of the clustering results. WHile, this did not create a beautiful map, I was able to use this package to create a more informative visualization of the clustering results.I can definitely see the value in using this for other projects. It also gave me an understanding of the overlap between the clusters. This made sense because there were many more clusters (14) using K-Means clustering than the 8 clusters in the original lab 3.
# Enhanced visualization with factoextra
fviz_cluster(kmeans_result, data = scaled_phx,
geom = "point", ellipse.type = "convex",
main = "K-Means Clustering Visualization")The clustering algorithm has grouped the census tracts into 14 clusters based on the selected demographic and socioeconomic variables. Each cluster represents a group of census tracts that are similar to each other in terms of these variables. You can now explore the characteristics of each cluster to identify patterns and trends in the data.
I loved how the previous lab 3 provided these useful cluster profile groups so I wanted to compare them with the other clustering model. I used the code provided in the original lab 3 to create these cluster profiles. However, I ran into much difficulty.
Chat GPT helped me work through the code to create the cluster profiles. I took a lot of trial and error, but in the end, me and Chat GPT figured it out.
Cluster Profiles
# Now, group by 'cluster'
set.seed(123)
kmeans_result <- kmeans(d2, centers = 14)
d2$cluster <- kmeans_result$cluster
stats <- d2 %>%
group_by(cluster) %>%
summarise(across(everything(), mean, na.rm = TRUE)) # Calculate mean for all columns
# Prepare data frame for plotting
t <- data.frame(t(stats), stringsAsFactors = FALSE)
names(t) <- paste0("GROUP.", 1:ncol(t)) # Use the number of columns in t
t <- t[-1,] # Remove the first row if necessary
t <- data.frame( t(stats), stringsAsFactors=F )
names(t) <- paste0( "GROUP.", 1:14 )
t <- t[-1,]
for( i in 1:14 )
{
z <- t[,i]
plot( rep(1,30), 1:30, bty="n", xlim=c(-75,100),
type="n", xaxt="n", yaxt="n",
xlab="Percentile", ylab="",
main=paste("GROUP",i) )
rect( xleft=0, ybottom=0, xright=20, ytop=(31), col=gray(0.75,0.5), border="gray80" )
rect( xleft=40, ybottom=0, xright=60, ytop=(31), col=gray(0.75,0.5), border="gray80" )
rect( xleft=80, ybottom=0, xright=100, ytop=(31), col=gray(0.75,0.5), border="gray80" )
abline( v=seq(0,100,25), lty=3, lwd=1.5, col="gray90" )
segments( y0=1:30, x0=0, x1=100, col="gray70", lwd=2 )
text( -0.2, 1:30, data.dictionary$VARIABLE[-1], cex=0.85, pos=2 )
points( z, 1:30, pch=19, col="firebrick", cex=1.5 )
axis( side=1, at=c(0,50,100), col.axis="gray", col="gray" )
}
print(plot) # Print the plotOverall, I found this explainer assignment to be a valuable learning experience, and I look forward to applying what I have learned to future projects. I also learned about K-Means Clustering and how it can be useful. In my city data, I only got 4 clusters so I wonder if my clusters will be more meaningful if this were to be used for Seattle/Tacoma demographics. An A for Chat GPT. I was skeptical about its usefulness for coding because it has not been useful for trying to find resources for research projects but I can’t imagine not using it for all my coding questions in the future.