Show the code
#install.packages(c("NbClust","flexclust"))
library(NbClust)
library(flexclust)
library(dplyr)
library(kableExtra)
library(sf)MUSA 500, Homework #5
Clustering analysis helps us to identify clustering in a dataset. In exploring data, it is particularly helpful for partitioning data into smaller groups for further investigation.
In this assignment, we use k-means clustering, which is widely applied to larger datasets and numeric variables. Here, we examine the same dataset that we used in Homeworks #1 and #2. K-means partitions the dataset into a number \(k\) of overlapping clusters, \(k\) being predetermined by the researchers. Our dataset includes five variables: MEDHVAL, MEDHHINC, PCTBACHMOR, PCTSINGLES, and PCTVACANT. K-means clustering will help us to identify if there is any clustering in the aggregated effect of these variables. Furthermore, we will examine whether the clustering of these variables is spatially autocorrelated.
By applying k-means clustering, we can answer questions like, How do we categorize the homeowners in question? How many classes are we able to identify from the data, and what are their characteristics? Incorporating spatial data, we can answer questions like, Are specific classes spatially clustered? Can we leverage our understanding of these clusters to create more equitable city policy?
#install.packages(c("NbClust","flexclust"))
library(NbClust)
library(flexclust)
library(dplyr)
library(kableExtra)
library(sf)First, reseachers deterine the intended number (\(k\)) of clusters. There are various methods for selecting a best value for \(k\). In R, the NbClust package helps. In this project, we will utilize this package to generate a scree plot of the sum of squared errors (SSE) within each group, which will yield the ideal value for \(k\).
Next, we iteratively improve cluster assignment until there is no change in clustering membership (i.e., each data point is assigned to a particular clustering center without further change), or until a set number of iterations. This process involves:
1. Identify \(k\) data points as centers for each cluster (usually randomly) 2. Assign each data point to a cluster by calculating its Euclidian distance from the cluster centers and choosing the closest center 3. Given the assigned clusters with data points, recalculate new centers for each cluster, then repeat from step #1. If there is no change, or if you have reached the pre-set number of iterations, stop here.
In this assignment, since we are looking at five different variables, we ensure that all data operate at the sme scale before proceeding to k-means clustering and analysis.
Other clustering methods include hierarchical clustering and density-based clustering (DBSCAN). In comparison to the other two, though, k-means appears to be the most appropriate choice in our context. Hierarchical clustering is better suited to smaller datasets, ideally under 100. Since this dataset contains 1,720 observations, it is too big for hierarchical clustering. There is no reason why DBSCAN is more or less appropriate in this case.
The first scree plot shows the SSE of each number of clusters from 1 to 20. In the graph below, there is a distinct drop in the SSE when the number of clusters moves from 1 to 3. After four clusters, the rate of decrease drops off, suggesting that a 3-cluster solution is likely to be a good option for our data.
Given NbClust’s ability to use different methods to seek the ideal number of clusters, we have also plotted the number of clusters suggested using 26 criteria. From this plot, we can see that 2 and 3 clusters are most frequently suggested, yet in this situation, having only 2 clusters will not help with further analysis. Hence, our choice of 3 clusters seem appropriate.
regression <- read.csv("https://github.com/nlebovits/musa-500-hmwk-5/raw/main/data/RegressionData.csv")
data <- read.csv("https://github.com/nlebovits/musa-500-hmwk-5/raw/main/data/RegressionData.csv")
data <- data %>% select(-POLY_ID, -AREAKEY)
data <- scale(data)
data_see <- (nrow(data)-1)*sum(apply(data,2,var))
#nrow(data)-1
#apply(data,2,var)
kmeans(data, centers=2)
for (i in 2:20) data_see[i] <- sum(kmeans(data, centers=i)$withinss)
#data_see
plot(1:20, data_see, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares erorrs")set.seed(100)
nc <- NbClust(data, min.nc=2, max.nc=15, method="kmeans", index="all")#table(nc$Best.n[1,])
par(mfrow=c(1,1))
barplot(table(nc$Best.n[1,]),
xlab="Numer of Clusters", ylab="Number of Criteria",
main="Number of Clusters Chosen by 26 Criteria")set.seed(1234)
fit.km <- kmeans(data, 3, nstart=1000)
fit.km$size
round(fit.km$centers, 2)
fit.km$cluster
cluster_info = as.data.frame(fit.km$cluster)
table <- cbind(round(aggregate(data, by=list(cluster=fit.km$cluster), mean),1),fit.km$size)
colnames(table)[1] = c("CLUSTER")
colnames(table)[8] = c("SIZE")table <- table[, c(1, 8, 2, 3, 4, 5, 6, 7)]
table %>%
kbl() %>%
kable_paper("hover", full_width = F)| CLUSTER | SIZE | MEDHVAL | PCTBACHMOR | MEDHHINC | PCTVACANT | PCTSINGLES | NBELPOV100 |
|---|---|---|---|---|---|---|---|
| 1 | 651 | -0.5 | -0.5 | -0.7 | 0.8 | -0.2 | 0.7 |
| 2 | 159 | 2.0 | 2.2 | 1.6 | -0.7 | 1.5 | -0.6 |
| 3 | 910 | 0.0 | 0.0 | 0.2 | -0.5 | -0.1 | -0.4 |
Overall, the clusters makes sense across the 5 variables. We can see that Group 1 has the lowest house value, percentage of bachelor’s degree, median household income, and percentage of single family houses, as well as the highest percentage of vacant house and poverty rate out of all cluster group. Group 2 is the opposite, and Group 3 falls in between. The clustering shows the socioeconomic characteristics of each group; taking everything into consideration, we came up with the following group names:
Regression_cluster <- cbind(regression, cluster_info)
regression_shp <- st_read('C:/Users/Nissim/Desktop/Fall 2022/Spat Stats/Homeworks/ass_1_data_shp/RegressionData.shp')
#regression_shp <- st_read("/Users/annzhang/Downloads/HW 5/RegressionData.shp")
regression_shp <- left_join(regression_shp, Regression_cluster, by = "POLY_ID") %>% st_as_sf()
regression_cluster <- regression_shp %>% mutate(cluster = fit.km$cluster) %>% dplyr::select(cluster)
plot(regression_cluster)By plotting the three groups on a map (as shown above), we can see some spatial autocorrelation. There seems to be spatial clustering of Group 2 (higher income, more educated communities) in some parts of Center City, the northwestern part of the city, and some parts of the northeast, and clustering of Group 1 (lower income) in the area north to the center city and some parts of West Philly. Since the clustering areas are not concentrated, it is difficult to rename the clusters by geographic locations.
To sum up, we used k-means clustering on 5 variables and created three cluster groups representing three socioeconomic groups and their respective house value. The clustering does a relatively good job of identifying clusters. Although none of the findings are surprising, the map in result section examines the spatial clustering of the cluster groups we created shows a spatial autocorrelation.