The rental housing market is a dynamic and complex system influenced by various factors, including location, property size, amenities, and market conditions. Understanding how these factors interact can provide valuable insights for both property owners and renters.
This project aims to analyze apartment rental listings using unsupervised machine learning techniques, including dimensionality reduction (t-SNE & UMAP) and clustering (DBSCAN). The goal is to uncover hidden patterns in the data, identify potential rental market segments, and analyze how key features such as price, square footage, and geographic location influence rental prices.
[Dataset:]https://archive.ics.uci.edu/dataset/555/apartment+for+rent+classified
library(ggplot2)
library(Rtsne)
library(umap)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Load the dataset
data <- read.csv("C:/Users/admin/Desktop/unsupervied-learning/project/dimention reduction/apartments_for_rent_classified_10K.csv", sep=";")
# Select numeric features for dimensionality reduction
features <- c("price", "square_feet", "bathrooms", "bedrooms", "latitude", "longitude")
df <- data %>%
mutate(
bathrooms = as.numeric(ifelse(bathrooms == "null", NA, bathrooms)),
bedrooms = as.numeric(ifelse(bedrooms == "null", NA, bedrooms)),
latitude = as.numeric(latitude),
longitude = as.numeric(longitude),
square_feet = as.numeric(square_feet), # Ensure 'square_feet' is numeric
price = as.numeric(price) # Ensure 'price' is numeric
) %>%
select(all_of(features)) %>% # Select only relevant columns
na.omit() # Remove rows with missing values
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `latitude = as.numeric(latitude)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
# Standardize numeric features
df_scaled <- scale(df)
# Remove duplicate rows
df_scaled_unique <- unique(df_scaled)
# Display structure of transformed data
str(df_scaled_unique)
## num [1:9744, 1:6] -0.09058 -0.52219 0.91652 0.00688 0.19252 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:9744] "3" "4" "6" "9" ...
## ..$ : chr [1:6] "price" "square_feet" "bathrooms" "bedrooms" ...
# Print the number of duplicates removed
cat("Number of duplicate rows removed:", nrow(df) - nrow(df_scaled_unique), "\n")
## Number of duplicate rows removed: 206
t-SNE (t-distributed Stochastic Neighbor Embedding) is an
unsupervised non-linear dimensionality reduction technique for data
exploration and visualizing high-dimensional data. Non-linear
dimensionality reduction means that the algorithm allows us to separate
data that cannot be separated by a straight line.
The t-SNE algorithm finds the similarity measure between pairs of
instances in higher and lower dimensional space. After that, it tries to
optimize two similarity measures. It does all of that in three
steps.
1.t-SNE models a point being selected as a neighbor of another point
in both higher and lower dimensions. It starts by calculating a pairwise
similarity between all data points in the high-dimensional space using a
Gaussian kernel. The points far apart have a lower probability of being
picked than the points close together.
2.The algorithm then tries to map higher-dimensional data points onto
lower-dimensional space while preserving the pairwise
similarities.
3.It is achieved by minimizing the divergence between the original
high-dimensional and lower-dimensional probability distribution. The
algorithm uses gradient descent to minimize the divergence. The
lower-dimensional embedding is optimized to a stable state.
The optimization process allows the creation of clusters and sub-clusters of similar data points in the lower-dimensional space, which are visualized to understand the structure and relationships in the higher-dimensional data.
set.seed(42)
# Test different perplexity values
tsne_result_10 <- Rtsne(df_scaled_unique, dims=2, perplexity=10, verbose=TRUE, max_iter=500)
## Performing PCA
## Read the 9744 x 6 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 10.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 1.76 seconds (sparsity = 0.004080)!
## Learning embedding...
## Iteration 50: error is 110.886818 (50 iterations in 2.46 seconds)
## Iteration 100: error is 92.004418 (50 iterations in 2.27 seconds)
## Iteration 150: error is 84.016482 (50 iterations in 2.22 seconds)
## Iteration 200: error is 80.261447 (50 iterations in 1.90 seconds)
## Iteration 250: error is 77.939601 (50 iterations in 2.34 seconds)
## Iteration 300: error is 3.274628 (50 iterations in 1.92 seconds)
## Iteration 350: error is 2.763895 (50 iterations in 1.69 seconds)
## Iteration 400: error is 2.401364 (50 iterations in 1.66 seconds)
## Iteration 450: error is 2.136538 (50 iterations in 1.63 seconds)
## Iteration 500: error is 1.933873 (50 iterations in 1.89 seconds)
## Fitting performed in 19.99 seconds.
tsne_result_50 <- Rtsne(df_scaled_unique, dims=2, perplexity=50, verbose=TRUE, max_iter=500)
## Performing PCA
## Read the 9744 x 6 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 50.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 4.98 seconds (sparsity = 0.020599)!
## Learning embedding...
## Iteration 50: error is 91.645231 (50 iterations in 2.61 seconds)
## Iteration 100: error is 75.009698 (50 iterations in 3.34 seconds)
## Iteration 150: error is 69.436272 (50 iterations in 2.27 seconds)
## Iteration 200: error is 67.560854 (50 iterations in 2.25 seconds)
## Iteration 250: error is 66.643510 (50 iterations in 2.33 seconds)
## Iteration 300: error is 2.110687 (50 iterations in 2.24 seconds)
## Iteration 350: error is 1.691964 (50 iterations in 2.27 seconds)
## Iteration 400: error is 1.442128 (50 iterations in 2.27 seconds)
## Iteration 450: error is 1.277740 (50 iterations in 2.29 seconds)
## Iteration 500: error is 1.161792 (50 iterations in 2.32 seconds)
## Fitting performed in 24.19 seconds.
tsne_result_100 <- Rtsne(df_scaled_unique, dims=2, perplexity=100, verbose=TRUE, max_iter=500)
## Performing PCA
## Read the 9744 x 6 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 100.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 10.08 seconds (sparsity = 0.041529)!
## Learning embedding...
## Iteration 50: error is 83.311580 (50 iterations in 3.57 seconds)
## Iteration 100: error is 69.095712 (50 iterations in 3.12 seconds)
## Iteration 150: error is 64.280689 (50 iterations in 2.89 seconds)
## Iteration 200: error is 63.073885 (50 iterations in 3.02 seconds)
## Iteration 250: error is 62.503352 (50 iterations in 2.96 seconds)
## Iteration 300: error is 1.653361 (50 iterations in 2.89 seconds)
## Iteration 350: error is 1.304821 (50 iterations in 2.77 seconds)
## Iteration 400: error is 1.114736 (50 iterations in 2.63 seconds)
## Iteration 450: error is 0.994993 (50 iterations in 2.57 seconds)
## Iteration 500: error is 0.914776 (50 iterations in 2.59 seconds)
## Fitting performed in 29.01 seconds.
# Create data frames
tsne_df_10 <- data.frame(tsne_result_10$Y)
colnames(tsne_df_10) <- c("Dim1", "Dim2")
tsne_df_10$Perplexity <- "10"
tsne_df_50 <- data.frame(tsne_result_50$Y)
colnames(tsne_df_50) <- c("Dim1", "Dim2")
tsne_df_50$Perplexity <- "50"
tsne_df_100 <- data.frame(tsne_result_100$Y)
colnames(tsne_df_100) <- c("Dim1", "Dim2")
tsne_df_100$Perplexity <- "100"
# Combine all results
tsne_df_all <- rbind(tsne_df_10, tsne_df_50, tsne_df_100)
# Visualize t-SNE results with different perplexity values
ggplot(tsne_df_all, aes(x=Dim1, y=Dim2, color=Perplexity)) +
geom_point(alpha=0.5) +
facet_wrap(~Perplexity) + # Show results for different perplexity values
theme_minimal() +
ggtitle("Comparison of t-SNE Results with Different Perplexity Values")
Perplexity = 10(left,red): The data points are densely
distributed, forming many small clusters, but the overall shape is
relatively uniform. Perhaps too much focus on local structure makes the
global structure unclear.
Perplexity = 50 (right, blue):
Some obvious local clustering structure is formed. Good for medium-sized
data, some potential grouping can be seen. Perplexity = 100 (middle,
green):
Larger clumps are formed, but the global structure begins to become
blurred.
UMAP (Uniform Manifold Approximation and Projection) is a popular dimensionality reduction technique that is particularly well-suited for visualizing high-dimensional datasets in two or three dimensions. Developed by Leland McInnes, John Healy, and James Melville in 2018, UMAP is rooted in concepts from Riemannian geometry and algebraic topology—specifically, it builds a graph (or fuzzy simplicial complex) representing the high-dimensional manifold and then optimizes a low-dimensional embedding that preserves this structure as faithfully as possible.
Key ideas of UMAP:
1.Manifold Assumption: Assumes that data in high-dimensional space lie
on (or close to) a manifold of lower dimension.
2.Fuzzy Graph Representation: Constructs a weighted k-nearest neighbor
graph, capturing local relationships in the data.
3.Optimization: Finds an embedding in the lower-dimensional space (often
2D or 3D) such that the distances between points in this space reflect
the relationships in the original high-dimensional space as well as
possible.
set.seed(42)
umap_result <- umap(df_scaled_unique)
# Create a data frame
umap_df <- data.frame(umap_result$layout)
colnames(umap_df) <- c("Dim1", "Dim2")
# Visualize UMAP results
ggplot(umap_df, aes(x=Dim1, y=Dim2)) +
geom_point(alpha=0.5, color="purple") +
ggtitle("UMAP Visualization of Rental Listings") +
theme_minimal()
UMAP is more compact than t-SNE, forming multiple obvious clusters,
indicating that there are some common features between different rental
data. However, some data points are still scattered, which may mean
that:
Some properties have weak similarity in data features (such as price,
area, longitude and latitude, etc.). There may be outliers that affect
the overall clustering structure.
library(dbscan)
##
## Attaching package: 'dbscan'
## The following object is masked from 'package:stats':
##
## as.dendrogram
# Run DBSCAN clustering
set.seed(42)
dbscan_result <- dbscan(umap_df, eps=1, minPts=5)
# Add cluster labels to data
umap_df$cluster <- as.factor(dbscan_result$cluster)
# Visualize DBSCAN results
ggplot(umap_df, aes(x=Dim1, y=Dim2, color=cluster)) +
geom_point(alpha=0.5) +
theme_minimal() +
ggtitle("DBSCAN Clustering on UMAP Results")
# Summary of clusters
table(umap_df$cluster)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 95 23 56 385 1128 522 160 641 170 1039 307 44 294 534 67 2183
## 17 18 19 20 21 22 23 24 25 26 27 28 29
## 291 32 145 105 28 237 784 30 16 49 127 235 17
df <- df[row.names(df_scaled_unique), ]
nrow(df) # Ensure it matches nrow(umap_df)
## [1] 9744
# Add clustering results
df$cluster <- umap_df$cluster
cluster_summary <- df %>%
group_by(cluster) %>%
summarise(
avg_price = mean(price, na.rm=TRUE),
avg_bedrooms = mean(bedrooms, na.rm=TRUE),
avg_bathrooms = mean(bathrooms, na.rm=TRUE),
avg_square_feet = mean(square_feet, na.rm=TRUE),
count = n()
)
# Display cluster statistics
print(cluster_summary)
## # A tibble: 29 × 6
## cluster avg_price avg_bedrooms avg_bathrooms avg_square_feet count
## <fct> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1 1189. 0 1 470. 95
## 2 2 1163. 0 1 345 23
## 3 3 1744. 0 1.01 434. 56
## 4 4 1070. 1 1.00 720. 385
## 5 5 1068. 0.999 1.00 669. 1128
## 6 6 1939. 0.998 1.01 645. 522
## 7 7 1100. 1 1.01 621. 160
## 8 8 1595. 1 1 710. 641
## 9 9 1210. 1 1.00 593. 170
## 10 10 950. 0.998 1.00 680. 1039
## # ℹ 19 more rows
Price variation:
The prices of different clusters vary significantly, with some clusters
(such as cluster 16) having an average price of up to $2,164, while some
low-priced clusters (such as cluster 20) have an average price of only
$787. These clusters may reflect different rental markets (high-end
vs. affordable).
Distribution of size and number of bedrooms:
Cluster 16 shows larger size and more bedrooms, which may indicate
high-end apartments or single-family homes. Clusters 1 and 2 have an
average number of bedrooms close to 0, which may be mainly single-family
apartments or small units.
Distribution of number of listings (count):
Some clusters (such as clusters 5 and cluster 16) have a large number of
listings, while others (such as clusters 25 and cluster 29) have only a
small number of listings.
# Visualize price distribution across clusters
ggplot(df, aes(x=as.factor(cluster), y=price, fill=as.factor(cluster))) +
geom_boxplot() +
theme_minimal() +
ggtitle("Price Distribution Across Clusters") +
xlab("Cluster") +
ylab("Price")
Most clusters have prices between $500 and $5000.
Cluster 16 has the highest prices:
The highest prices in this cluster are over $50,000! This may correspond
to high-end luxury apartments or houses.
Other clusters have relatively stable prices:
The prices of most clusters have not changed much, indicating that these
properties have similar price structures.
# Visualize square feet vs. number of bedrooms
ggplot(df, aes(x=square_feet, y=bedrooms, color=as.factor(cluster))) +
geom_point(alpha=0.5) +
theme_minimal() +
ggtitle("Cluster Analysis: Bedrooms vs Square Feet")
Bedrooms are mostly concentrated between 0-3:
Most listings have ≤ 3 bedrooms and are usually between 1000-3000 square
feet. This may indicate that most of the rental market is dominated by
small apartments and houses.
Some listings are particularly large:
There are a few listings with 6000-9000 square feet and 6+ bedrooms,
which may be high-end homes or luxury apartments. You can see which
cluster they belong to and whether they are associated with high
prices.
Cluster distribution:
Listings with low bedroom counts (0-1) are mostly in the low-end price
range. Larger listings (3+ bedrooms) may belong to different high-end
clusters (such as 16, 17).
# Geographic visualization of clusters
ggplot(df, aes(x=longitude, y=latitude, color=as.factor(cluster))) +
geom_point(alpha=0.5) +
theme_minimal() +
ggtitle("Cluster Distribution on Geographic Data")
The different colored clusters show some regionality:
Some clusters are distributed on the East Coast (such as New York,
Washington, and Boston). Some clusters are concentrated on the West
Coast (such as Los Angeles, San Francisco, and Seattle). Some are
scattered in the central and southern regions.
Some areas have almost no data:
For example, some western states (such as Montana and Wyoming) have
almost no listings, which may be because the dataset is mainly
concentrated in areas with active rental markets.
# Correlation between price and square feet
correlation <- cor(df$price, df$square_feet, use="complete.obs")
print(paste("Correlation between Price and Square Feet: ", round(correlation, 3)))
## [1] "Correlation between Price and Square Feet: 0.465"
# Visualize price vs. square feet by cluster
ggplot(df, aes(x=square_feet, y=price, color=as.factor(cluster))) +
geom_point(alpha=0.5) +
theme_minimal() +
ggtitle("Price vs Square Feet by Cluster")
# Log-linear regression analysis
lm_log <- lm(log(price) ~ square_feet + latitude + longitude + bedrooms + bathrooms, data = df)
summary(lm_log)
##
## Call:
## lm(formula = log(price) ~ square_feet + latitude + longitude +
## bedrooms + bathrooms, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1038 -0.2609 -0.0421 0.2353 3.4419
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.020e+00 3.635e-02 165.598 < 2e-16 ***
## square_feet 3.233e-04 1.376e-05 23.489 < 2e-16 ***
## latitude 3.309e-03 7.346e-04 4.504 6.73e-06 ***
## longitude -5.872e-03 2.557e-04 -22.966 < 2e-16 ***
## bedrooms -1.526e-02 6.595e-03 -2.314 0.0207 *
## bathrooms 1.459e-01 1.130e-02 12.909 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3951 on 9738 degrees of freedom
## Multiple R-squared: 0.2946, Adjusted R-squared: 0.2942
## F-statistic: 813.2 on 5 and 9738 DF, p-value: < 2.2e-16
Square Footage Still Drives Price, But Non-Linearly. Each extra
square foot increases price by 0.032%, meaning price rises more slowly
for larger houses. This supports diminishing returns—larger homes do get
more expensive, but at a slower rate. Latitude is Now Significant.
Unlike previous models where latitude had no effect, it is now
statistically significant (p = 6.73e-06). Suggests higher latitudes
(e.g., NYC, Chicago) have higher prices compared to southern areas.
Longitude is Highly Significant. Price drops by 0.59% for each degree
increase in longitude, confirming that Eastern U.S. is more expensive
than the West. Bedrooms Still Show Negative Impact Each extra bedroom
decreases price by 1.53%, which is counterintuitive. Possible
explanations: Smaller high-end apartments (studios) are expensive, while
larger homes with more bedrooms may have a lower price per square foot.
Multicollinearity with square footage—more bedrooms usually mean bigger
houses, but their impact on price may not be linear.
Bathrooms have the Strongest Impact. Each extra bathroom increases price
by 14.59%, indicating that bathroom count is a key factor in luxury
properties. Likely reflects premium housing (e.g., high-end apartments,
villas) where extra bathrooms significantly raise value. Model Fit (R² =
29.46%) This model explains 29.46% of price variation, an improvement
over the linear model (~25%). Still, 70% of price variation remains
unexplained, suggesting location, property type, and market conditions
play a bigger role.
The clustering results revealed distinct rental price segments,
including low-cost small apartments, mid-range rentals, and high-end
luxury properties.Cluster 16 emerged as the most expensive segment,
likely representing luxury apartments or high-end homes. Longitude was a
highly significant predictor of price, showing that East Coast
properties tend to be more expensive than those in the West. Latitude
became significant in the log-transformed model, indicating that higher
latitudes (e.g., NYC, Chicago) tend to have higher rental prices.
Bathrooms had the strongest positive impact on price, with each
additional bathroom increasing rental price by 14.59%. Bedrooms showed a
counterintuitive negative impact, possibly due to high-end studio
apartments being more expensive per square foot.
Our final regression model explained 29.46% of price variation,
indicating that while square footage, location, and amenities play a
role, external factors like demand, neighborhood, and property type also
heavily influence rental prices.