The data set contains information about house properties and their prices. It includes 1,047 entries, each representing a property. The columns provide key attributes of these houses and their corresponding prices. Cluster analysis on this dataset can uncover meaningful groupings of houses based on features like living area, bathrooms, bedrooms, lot size, age, and fireplaces, offering insights into property segmentation and market trends.
For example, clusters could differentiate luxury properties with large living spaces and premium features from affordable homes with fewer amenities or older constructions. Price segmentation could highlight what differentiates houses at various price points, such as modern amenities or larger lot sizes. Insights into buyer profiles, such as families preferring larger homes or investors targeting older, affordable properties, can guide marketing strategies. Ultimately, clustering helps identify the characteristics driving housing markets and informs targeted decisions.
Variable Descriptions:
As our Data has already arrive pre-cleaned, we will take a quick look at the statistical spread of our present columns to see what we are working with.
## Living Area Bathrooms Bedrooms Lot Size
## Min. :0.672 Min. :1.000 Min. :1.000 Min. :0.0000
## 1st Qu.:1.336 1st Qu.:1.500 1st Qu.:3.000 1st Qu.:0.2100
## Median :1.672 Median :2.000 Median :3.000 Median :0.3900
## Mean :1.807 Mean :1.918 Mean :3.183 Mean :0.5696
## 3rd Qu.:2.206 3rd Qu.:2.500 3rd Qu.:4.000 3rd Qu.:0.6000
## Max. :4.534 Max. :4.500 Max. :6.000 Max. :9.0000
## Age Fireplace Price
## Min. : 0.00 Min. :0.0000 Min. : 1.686
## 1st Qu.: 6.00 1st Qu.:0.0000 1st Qu.:11.201
## Median : 18.00 Median :1.0000 Median :15.192
## Mean : 28.06 Mean :0.5931 Mean :16.386
## 3rd Qu.: 34.00 3rd Qu.:1.0000 3rd Qu.:20.523
## Max. :247.00 Max. :1.0000 Max. :44.644
Upon initial inspection all variables appear to be properly formatted with no apparent mis-inputs or missing values. One thing that can be acknowledged is the rather large spread from the 3rd quartiles and max values for some of the columns in out data set such as Age, Lot, and Price.
To observe these, we can make a quick box plot of our scaled data (used to minimize variation between measure units) and see if there are any intense outliers that may need altering.
hp_scale <- scale(houseprice)
hp_scale_long <- melt(hp_scale)
ggplot(hp_scale_long, aes(x = Var2, y = value)) +
geom_boxplot(aes(fill = Var2)) +
theme_minimal() +
labs(
title = "Boxplot of All Columns in Data-set",
x = "Variables",
y = "Values"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))As can be immediately observed, there is a rather striking number of outliers in our Age and Lot Size columns. These outliers could have adverse effects in the process and findings of our cluster analysis, so we will adjust accordingly by getting the log of these columns.
houseprice$`Lot Size` <- ifelse(houseprice$`Lot Size` <= 0, 0.1, houseprice$`Lot Size`)
houseprice$Age <- ifelse(houseprice$Age <= 0, 0.1, houseprice$Age)
houseprice$`Lot Size` <- log(houseprice$`Lot Size`)
houseprice$Age <- log(houseprice$Age)
hp_scale <- scale(houseprice)
hp_scale_long <- melt(hp_scale)
ggplot(hp_scale_long, aes(x = Var2, y = value)) +
geom_boxplot(aes(fill = Var2)) +
theme_minimal() +
labs(
title = "Boxplot of All Columns in Data-set",
x = "Variables",
y = "Values"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))This is a much better standardized version of our data. With this new code initiated we can get started with our Cluster Analysis
We will be mapping our our Data onto a dimensionally reduced plane so it is essential we have a legitimate PCA and stroong understanding of our PC’s before moving forward and making analyzations:
pca_result = prcomp(hp_scale)
fviz_eig(pca_result, addlabels = TRUE, main = "Explained Variance by PCA Components")+
geom_bar(stat = "identity", fill = "#898A84", color = "black") +
geom_line(aes(group = 1), color = "black")+
theme_minimal()p1 = fviz_contrib(pca_result, choice = "var", axes = 1) +
ggtitle("Contributions to PC1")+
geom_bar(stat = "identity", fill = "#898A84", color = "black") +
theme_minimal() + theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 10)
)+
labs(x = "Column Name")
p2 = fviz_contrib(pca_result, choice = "var", axes = 2) +
ggtitle("Contributions to PC2")+
geom_bar(stat = "identity", fill = "#898A84", color = "black") +
theme_minimal() + theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 10)
)+
labs(x = "Column Name")
grid.arrange(p1, p2, ncol = 2)PC1 explains 52.2% of the total variance, making it the most significant component for understanding the dataset.
PC1 and PC2 together explain 66.5% of the variance (52.2% + 14.3%). This indicates that most of the dataset’s variability can be captured using the first two components, suggesting dimensionality reduction is effective here.
PC1 Contributions:
The top contributors to PC1 are Living Area, Price, and Bathrooms, with all three contributing over 15%. These variables likely represent aspects of size and cost, indicating PC1 captures variance related to property scale or value.
PC2 Contributions:
The primary contributor to PC2 is Lot Size, contributing over 60%, followed by Age and Bedrooms. This suggests PC2 captures variance related to spatial characteristics and property maturity, distinguishing these features from the overall size/cost captured by PC1.
We can move forward and expand on these Principal Components with some Factor Analysis to accurately identify what latent factors could represent our Principal components.
##
## Call:
## factanal(x = houseprice, factors = 3, scores = "Bartlett", rotation = "varimax")
##
## Uniquenesses:
## Living Area Bathrooms Bedrooms Lot Size Age Fireplace
## 0.139 0.017 0.247 0.866 0.662 0.715
## Price
## 0.231
##
## Loadings:
## Factor1 Factor2 Factor3
## Living Area 0.419 0.604 0.567
## Bathrooms 0.899 0.292 0.298
## Bedrooms 0.240 0.826 0.115
## Lot Size 0.321 0.170
## Age -0.438 -0.375
## Fireplace 0.294 0.226 0.384
## Price 0.399 0.358 0.695
##
## Factor1 Factor2 Factor3
## SS loadings 1.482 1.419 1.223
## Proportion Var 0.212 0.203 0.175
## Cumulative Var 0.212 0.414 0.589
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 1.62 on 3 degrees of freedom.
## The p-value is 0.654
Loadings measure how strongly each variable is associated with the two factors:
Factor 1 (Property Quality): Strongly associated with Bathrooms (0.899) and moderately with Price (0.399) and Living Area (0.419). Likely reflects property quality or luxury features.
Factor 2 (Spatial Characteristics): Strongly associated with Bedrooms (0.826) and moderately with Living Area (0.604). Likely represents the spatial capacity of the property.
Factor 3 (Property Value): Strongly associated with Price (0.695) and moderately with Living Area (0.567). Likely reflects monetary value or market appeal.
Our Factor Analysis reflects similar findings to our assumptions made with PC contributions. For the sake of interpretability, we will continue this analysis utilizing the first two principal components to graph our our data. We will have:
We can continue forward with plotting our PCA to get an idea of the spread of our data before organizing it into clusters.
fviz_pca_biplot(pca_result,
ind.var = 'cos2',
col.var = "contrib",
gradient.cols = c(low="blue", high="red"),
repel = TRUE, # Avoid overlapping labels
geom = "point", # Display individuals as points
labelsize = 3
)fviz_pca_ind(pca_result, title="PCA - Spread",
geom="point",
ggtheme=theme_classic(),
legend="bottom",
col.ind = houseprice$`Living Area`)+
scale_color_gradient(low = "blue", high = "red")+
labs(color = "Living Area")In this section we will first analyze our data with a cluster analysis methods known as K-Means and K-Medoids.
K-means uses the mean (average) of data points in a cluster as the cluster center, minimizing the sum of squared distances between points and their cluster center. It is computationally efficient but sensitive to outliers, as the mean can be heavily influenced by extreme values.
K-medoids, on the other hand, uses an actual data point (called a medoid) as the cluster center, minimizing the sum of absolute distances, making it more robust to outliers.
With both these methods we will first have to tell our program how many centroids we want to initiate our analysis with. To ensure we are working with an optimal number of centroids.
First, must assess whether the data is suitable for clustering with the Hopkins test.
## [1] 0.9947356
Using the hopkins library, calculated values 0-0.3 indicate regularly-spaced data. Values around 0.5 indicate random data. Values 0.7-1 indicate clustered data.
Our Hopkins test revealed a score of .98 giving us strong evidence of the data being clustered. With this established, we can then move forward with selecting the proper number of clusters
fviz_nbclust(hp_scale, kmeans, method = "wss",nstart = 200) +
geom_vline(xintercept = 4, linetype = 2)With this visualization we observe it is in our best intrest to continue forward with 4-5 clusters, and for the sake of simplicity on the eye in future visualizations we will work with 4.
With that said let us separate our data points into their respective 4 clusters and analyze.
## [1] 226 297 358 166
## cluster Living Area Bathrooms Bedrooms Lot Size Age Fireplace
## 1 1 1.502673 1.550885 2.769912 -1.2588238 3.1973317 1.0000000
## 2 2 2.088135 2.380471 3.528620 -0.6811954 2.1885889 0.8047138
## 3 3 1.303774 1.418994 2.812849 -1.2863580 3.2838057 0.0000000
## 4 4 2.805512 2.668675 3.927711 -0.7370022 -0.2504224 0.9397590
## Price
## 1 13.54179
## 2 18.93683
## 3 11.39620
## 4 26.45688
fviz_cluster(datos.km,
data = hp_scale,
repel = T, geom = "point")+
theme_minimal()+
labs(title = "K-Means Cluster Plot")General Observations:
The four clusters are well-separated along Dim1 (52.2%) and Dim2 (14.3%), which collectively explain 66.5% of the variance. This suggests that the principal components (PCs) effectively differentiate the data into meaningful groups, Specifically PC1 as these groups are seperated by mainly horizontal position.
Cluster-Specific Insights:
Cluster 1 (Red, Circle):
Positioned towards higher Dim1 values. Likely represents properties with higher quality or luxury, characterized by more bathrooms and higher price. May include larger properties but with fewer bedrooms (moderate spatial features).
Cluster 2 (Green, Triangle):
Spread along Dim2 with moderate Dim1 values. Likely represents properties with prominent spatial characteristics, such as larger Living Areas or more Bedrooms, but moderate cost and quality. These may be family-oriented properties.
Cluster 3 (Blue, Square):
Dominates the lower-right quadrant with lower Dim2 and moderate-to-high Dim1. Likely represents properties with moderate size and cost but without extreme spatial characteristics (e.g., fewer bedrooms or smaller lot sizes). May correspond to average or balanced properties.
Cluster 4 (Purple, Plus):
Positioned towards lower values on both Dim1 and Dim2. Likely represents smaller, lower-cost properties with minimal features, such as fewer bathrooms, smaller living areas, and potentially older age. Could represent budget or starter homes.
We can undergo a similar process of identifying the proper number of centroids for this partition method of clustering then move forward with our analysis from there.
We will approach the rationalization for the number of centroids using a different method shown below:
#install.packages("cluster")
fviz_nbclust(hp_scale, cluster::pam, method = "silhouette",
k.max = 5, nstart = 25)Our program has identified the proper number of clusters for this partitioning algorithm to be 2. Though this differs from the 4 clusters we based our K-means algorithm on, we can still gain insight of how our data is clustered amongst these two groups.
pam.res = pam(houseprice, 2)
fviz_cluster(pam.res,
ellipse.type = "t",
repel = TRUE,
geom = "point"
)+
theme_minimal()+
labs(title = "K-Medoids Cluster Plot")Cluster Characteristics:
Cluster 1 (Red): Denser grouping along positive values of Dim1. Likely represents properties with higher prices, larger living areas, and more luxurious features.
Cluster 2 (Blue): More dispersed and spread out along both Dim1 and Dim2. Represents properties with a wider range of spatial characteristics (e.g., different numbers of bedrooms) but generally lower values in size and price compared to Cluster 1.
Cluster Overlap: The overlap between clusters suggests shared features or transitional properties that do not fit neatly into one cluster, such as mid-sized or mid-priced properties.
K-Means (4 Clusters):
Provides richer insights by splitting data into smaller, distinct groups, each representing unique property profiles (e.g., luxury properties, family-sized homes, budget-friendly homes, or niche categories).
Overlapping areas suggest shared characteristics between groups (e.g., transitional properties that could appeal to multiple market segments).
Enables detailed segmentation for targeted real estate strategies:
K-Medoids (2 Clusters):
Offers a broader perspective, focusing on two primary groups with substantial overlap. This clustering is more robust to outliers and works well for high-level segmentation but misses detailed nuances.
Useful for high-level strategic decisions:
We will Take a step away from partitioning algorithms to draw analysis from a new type of clustering, heirchal clustering.
Hierarchical clustering is a method of grouping data points into clusters based on their similarity, using a tree-like structure called a dendrogram. It starts by treating each data point as its own cluster and progressively merges the closest clusters based on a distance metric and a linkage method (e.g., complete, single, or average). The result allows you to visualize clusters at various levels of granularity and decide how many clusters best represent the data.
Our first step: to calculate the pairwise Euclidean distances between all observations in the dataset. This distance matrix quantifies how similar or dissimilar each pair of data points is, serving as the input for hierarchical clustering. It ensures the clustering algorithm can group observations based on their similarities by merging closer points into clusters iteratively.
Above we have visualized the pairwise Euclidean distance matrix as a heatmap. Each cell in the heatmap represents the distance between two data points, with darker or lighter shades indicating smaller or larger distances, respectively. This helps identify patterns, such as clusters of points that are closer together (low distances), before applying clustering methods. It provides a visual intuition of how similar or dissimilar the data points are. With the extreme size of our dataset, it is hard to draw any initial isights from the heatmap, so we will simply continue with our analysis.
As mentioned above we have several different linkage methods with respect to how to structure our tree. Before graphing, we will test the validity of our methods with the cophenetic correlation coefficient (CCC). It measures how well the hierarchical clustering results (dendrogram) preserve the original pairwise distances between observations.
#Ward.D2 Method
datos.hc_ward = hclust(d = datos.dist.eucl, method = "ward.D2")
res.coph = cophenetic(datos.hc_ward)
cor(datos.dist.eucl,res.coph)## [1] 0.7046158
#Average Method
datos.hc_avg = hclust(d = datos.dist.eucl, method = "average")
res.coph = cophenetic(datos.hc_avg)
cor(datos.dist.eucl,res.coph)## [1] 0.7760909
#Complete Method
datos.hc_com = hclust(d = datos.dist.eucl, method = "complete")
res.coph = cophenetic(datos.hc_com)
cor(datos.dist.eucl,res.coph)## [1] 0.6355727
Based on the outcomes. The two linkage methods of interest would be the Average and Ward D2 methods as they revealed to have the highest CCC score.
Before deciding on which method to move forward with analyzing, we will do some light level clustering and graphing of our two methods to get an initial grasp of which we would like to dissect.
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
c2 = fviz_dend(datos.hc_avg, cex = 0.5)+
labs(title = "Average Method")
grid.arrange(c1, c2, ncol = 2)## grp_ward
## 1 2 3
## 400 419 228
## grp_avg
## 1 2 3
## 704 310 33
Due to the extreme inbalance in the cluster grouping that come with the average method, we will continue forward in our analysis with the Ward D2 Method as it is better for creating well-defined, balanced clusters that are more interpretable and homogeneous.
This dendrogram indicates that 3 clusters are appropriate because the largest vertical distances in the tree, which represent significant dissimilarities between clusters, occur at a height of approximately 150. Cutting the dendrogram at this height results in three distinct groups that capture meaningful differences in the data. Below this height, branches become tightly packed, and additional splits would lead to smaller clusters that may overfit the data or merge points with minimal differences, reducing interpretability. Thus, 3 clusters balance distinct group separation and practical analysis.
fviz_dend(datos.hc_ward,
cex = 0.5,
k = 3,
k_colors = c("#2E9FDF", "#E7B800", "#FC4E07"),
color_labels_by_k = TRUE,
rect = TRUE
)Now that we have seperated our dendrogram into its clusters, we can graph these cluster in a similar fashion to our K-Means and K-Medoids methods to draw some insight on what the clusters could represent and their significance.
set.seed(123)
fviz_cluster(list(data = houseprice, cluster = grp_ward),
palette = c("#2E9FDF", "#E7B800", "#FC4E07"),
ellipse.type = "convex",
repel = TRUE,
show.clust.cent = FALSE,
geom = "point")+
theme_minimal()+
labs(title = "Ward D2 Method Hierarchal Clustering")Cluster 1 (Blue):
Positioned in the lower-left quadrant with lower values on Dim1 and Dim2. Likely represents smaller or budget-friendly properties with minimal features.
Target budget-conscious buyers or those looking for starter homes.
Cluster 2 (Yellow):
Centered in the upper-right quadrant, with higher values on both Dim1 and Dim2. Represents properties with larger sizes, more features (e.g., more bedrooms or living space), and potentially higher prices.
Focus marketing on families or high-income buyers seeking larger and more luxurious properties.
Cluster 3 (Red):
Spreads across moderate values of Dim1, overlapping with both other clusters. Likely represents transitional or mid-range properties that balance features and cost.
Cater to mid-market buyers who prioritize affordability without sacrificing features.
1 Choose the Method Based on Business Goals:
High-Level Insights: Use K-Medoids for simpler, robust segmentation (e.g., premium vs. budget categories).
Detailed Segmentation: Use K-Means for finer distinctions when smaller subgroups matter (e.g., niche marketing campaigns).
Balanced Clusters: Use Ward D2 for balanced, well-separated groups suitable for strategic decision-making.
2 Actionable Strategies (Based off K-means):
Cluster 1 (small size, low cost): Target budget-conscious buyers.
Cluster 2 (moderate size and cost): Highlight balance and affordability.
Cluster 3/4 (large size, high cost): Market to premium or family-oriented buyers.