Project 3

Analyse the Apartment for Rent

Introduction
Data Preparation
t-SNE and UMAP
Clustering(DBSCAN)
Conclusion

Introduction

The rental housing market is a dynamic and complex system influenced by various factors, including location, property size, amenities, and market conditions. Understanding how these factors interact can provide valuable insights for both property owners and renters.

This project aims to analyze apartment rental listings using unsupervised machine learning techniques, including dimensionality reduction (t-SNE & UMAP) and clustering (DBSCAN). The goal is to uncover hidden patterns in the data, identify potential rental market segments, and analyze how key features such as price, square footage, and geographic location influence rental prices.

[Dataset:]https://archive.ics.uci.edu/dataset/555/apartment+for+rent+classified

Data Preparation

library(ggplot2)
library(Rtsne)
library(umap)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Load the dataset

data <- read.csv("C:/Users/admin/Desktop/unsupervied-learning/project/dimention reduction/apartments_for_rent_classified_10K.csv", sep=";")

# Select numeric features for dimensionality reduction
features <- c("price", "square_feet", "bathrooms", "bedrooms", "latitude", "longitude")


df <- data %>%
  mutate(
    bathrooms = as.numeric(ifelse(bathrooms == "null", NA, bathrooms)),
    bedrooms = as.numeric(ifelse(bedrooms == "null", NA, bedrooms)),
    latitude = as.numeric(latitude),
    longitude = as.numeric(longitude),
    square_feet = as.numeric(square_feet),  # Ensure 'square_feet' is numeric
    price = as.numeric(price)               # Ensure 'price' is numeric
  ) %>%
  select(all_of(features)) %>%  # Select only relevant columns
  na.omit()                     # Remove rows with missing values

## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `latitude = as.numeric(latitude)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.

# Standardize numeric features
df_scaled <- scale(df)

# Remove duplicate rows
df_scaled_unique <- unique(df_scaled)

# Display structure of transformed data
str(df_scaled_unique)

##  num [1:9744, 1:6] -0.09058 -0.52219 0.91652 0.00688 0.19252 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:9744] "3" "4" "6" "9" ...
##   ..$ : chr [1:6] "price" "square_feet" "bathrooms" "bedrooms" ...

# Print the number of duplicates removed
cat("Number of duplicate rows removed:", nrow(df) - nrow(df_scaled_unique), "\n")

## Number of duplicate rows removed: 206

t-SNE and UMAP

t-SNE

t-SNE (t-distributed Stochastic Neighbor Embedding) is an unsupervised non-linear dimensionality reduction technique for data exploration and visualizing high-dimensional data. Non-linear dimensionality reduction means that the algorithm allows us to separate data that cannot be separated by a straight line.
The t-SNE algorithm finds the similarity measure between pairs of instances in higher and lower dimensional space. After that, it tries to optimize two similarity measures. It does all of that in three steps.

1.t-SNE models a point being selected as a neighbor of another point in both higher and lower dimensions. It starts by calculating a pairwise similarity between all data points in the high-dimensional space using a Gaussian kernel. The points far apart have a lower probability of being picked than the points close together.
2.The algorithm then tries to map higher-dimensional data points onto lower-dimensional space while preserving the pairwise similarities.
3.It is achieved by minimizing the divergence between the original high-dimensional and lower-dimensional probability distribution. The algorithm uses gradient descent to minimize the divergence. The lower-dimensional embedding is optimized to a stable state.

The optimization process allows the creation of clusters and sub-clusters of similar data points in the lower-dimensional space, which are visualized to understand the structure and relationships in the higher-dimensional data.

set.seed(42)

# Test different perplexity values
tsne_result_10 <- Rtsne(df_scaled_unique, dims=2, perplexity=10, verbose=TRUE, max_iter=500)

## Performing PCA
## Read the 9744 x 6 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 10.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 1.76 seconds (sparsity = 0.004080)!
## Learning embedding...
## Iteration 50: error is 110.886818 (50 iterations in 2.46 seconds)
## Iteration 100: error is 92.004418 (50 iterations in 2.27 seconds)
## Iteration 150: error is 84.016482 (50 iterations in 2.22 seconds)
## Iteration 200: error is 80.261447 (50 iterations in 1.90 seconds)
## Iteration 250: error is 77.939601 (50 iterations in 2.34 seconds)
## Iteration 300: error is 3.274628 (50 iterations in 1.92 seconds)
## Iteration 350: error is 2.763895 (50 iterations in 1.69 seconds)
## Iteration 400: error is 2.401364 (50 iterations in 1.66 seconds)
## Iteration 450: error is 2.136538 (50 iterations in 1.63 seconds)
## Iteration 500: error is 1.933873 (50 iterations in 1.89 seconds)
## Fitting performed in 19.99 seconds.

tsne_result_50 <- Rtsne(df_scaled_unique, dims=2, perplexity=50, verbose=TRUE, max_iter=500)

## Performing PCA
## Read the 9744 x 6 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 50.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 4.98 seconds (sparsity = 0.020599)!
## Learning embedding...
## Iteration 50: error is 91.645231 (50 iterations in 2.61 seconds)
## Iteration 100: error is 75.009698 (50 iterations in 3.34 seconds)
## Iteration 150: error is 69.436272 (50 iterations in 2.27 seconds)
## Iteration 200: error is 67.560854 (50 iterations in 2.25 seconds)
## Iteration 250: error is 66.643510 (50 iterations in 2.33 seconds)
## Iteration 300: error is 2.110687 (50 iterations in 2.24 seconds)
## Iteration 350: error is 1.691964 (50 iterations in 2.27 seconds)
## Iteration 400: error is 1.442128 (50 iterations in 2.27 seconds)
## Iteration 450: error is 1.277740 (50 iterations in 2.29 seconds)
## Iteration 500: error is 1.161792 (50 iterations in 2.32 seconds)
## Fitting performed in 24.19 seconds.

tsne_result_100 <- Rtsne(df_scaled_unique, dims=2, perplexity=100, verbose=TRUE, max_iter=500)

## Performing PCA
## Read the 9744 x 6 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 100.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 10.08 seconds (sparsity = 0.041529)!
## Learning embedding...
## Iteration 50: error is 83.311580 (50 iterations in 3.57 seconds)
## Iteration 100: error is 69.095712 (50 iterations in 3.12 seconds)
## Iteration 150: error is 64.280689 (50 iterations in 2.89 seconds)
## Iteration 200: error is 63.073885 (50 iterations in 3.02 seconds)
## Iteration 250: error is 62.503352 (50 iterations in 2.96 seconds)
## Iteration 300: error is 1.653361 (50 iterations in 2.89 seconds)
## Iteration 350: error is 1.304821 (50 iterations in 2.77 seconds)
## Iteration 400: error is 1.114736 (50 iterations in 2.63 seconds)
## Iteration 450: error is 0.994993 (50 iterations in 2.57 seconds)
## Iteration 500: error is 0.914776 (50 iterations in 2.59 seconds)
## Fitting performed in 29.01 seconds.

# Create data frames
tsne_df_10 <- data.frame(tsne_result_10$Y)
colnames(tsne_df_10) <- c("Dim1", "Dim2")
tsne_df_10$Perplexity <- "10"

tsne_df_50 <- data.frame(tsne_result_50$Y)
colnames(tsne_df_50) <- c("Dim1", "Dim2")
tsne_df_50$Perplexity <- "50"

tsne_df_100 <- data.frame(tsne_result_100$Y)
colnames(tsne_df_100) <- c("Dim1", "Dim2")
tsne_df_100$Perplexity <- "100"

# Combine all results
tsne_df_all <- rbind(tsne_df_10, tsne_df_50, tsne_df_100)

# Visualize t-SNE results with different perplexity values
ggplot(tsne_df_all, aes(x=Dim1, y=Dim2, color=Perplexity)) +
  geom_point(alpha=0.5) +
  facet_wrap(~Perplexity) +  # Show results for different perplexity values
  theme_minimal() +
  ggtitle("Comparison of t-SNE Results with Different Perplexity Values")

Perplexity = 10（left，red）： The data points are densely distributed, forming many small clusters, but the overall shape is relatively uniform. Perhaps too much focus on local structure makes the global structure unclear.
Perplexity = 50 (right, blue):
Some obvious local clustering structure is formed. Good for medium-sized data, some potential grouping can be seen. Perplexity = 100 (middle, green):
Larger clumps are formed, but the global structure begins to become blurred.

UMAP

UMAP (Uniform Manifold Approximation and Projection) is a popular dimensionality reduction technique that is particularly well-suited for visualizing high-dimensional datasets in two or three dimensions. Developed by Leland McInnes, John Healy, and James Melville in 2018, UMAP is rooted in concepts from Riemannian geometry and algebraic topology—specifically, it builds a graph (or fuzzy simplicial complex) representing the high-dimensional manifold and then optimizes a low-dimensional embedding that preserves this structure as faithfully as possible.

Key ideas of UMAP:
1.Manifold Assumption: Assumes that data in high-dimensional space lie on (or close to) a manifold of lower dimension.
2.Fuzzy Graph Representation: Constructs a weighted k-nearest neighbor graph, capturing local relationships in the data.
3.Optimization: Finds an embedding in the lower-dimensional space (often 2D or 3D) such that the distances between points in this space reflect the relationships in the original high-dimensional space as well as possible.

set.seed(42)
umap_result <- umap(df_scaled_unique)

# Create a data frame
umap_df <- data.frame(umap_result$layout)
colnames(umap_df) <- c("Dim1", "Dim2")

# Visualize UMAP results
ggplot(umap_df, aes(x=Dim1, y=Dim2)) +
  geom_point(alpha=0.5, color="purple") +
  ggtitle("UMAP Visualization of Rental Listings") +
  theme_minimal()

UMAP is more compact than t-SNE, forming multiple obvious clusters, indicating that there are some common features between different rental data. However, some data points are still scattered, which may mean that:
Some properties have weak similarity in data features (such as price, area, longitude and latitude, etc.). There may be outliers that affect the overall clustering structure.

Clustering(DBSCAN)

library(dbscan)

## 
## Attaching package: 'dbscan'

## The following object is masked from 'package:stats':
## 
##     as.dendrogram

# Run DBSCAN clustering
set.seed(42)
dbscan_result <- dbscan(umap_df, eps=1, minPts=5) 

# Add cluster labels to data
umap_df$cluster <- as.factor(dbscan_result$cluster)

# Visualize DBSCAN results
ggplot(umap_df, aes(x=Dim1, y=Dim2, color=cluster)) +
  geom_point(alpha=0.5) +
  theme_minimal() +
  ggtitle("DBSCAN Clustering on UMAP Results")

# Summary of clusters
table(umap_df$cluster)

## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
##   95   23   56  385 1128  522  160  641  170 1039  307   44  294  534   67 2183 
##   17   18   19   20   21   22   23   24   25   26   27   28   29 
##  291   32  145  105   28  237  784   30   16   49  127  235   17

df <- df[row.names(df_scaled_unique), ]
nrow(df)   # Ensure it matches nrow(umap_df)

## [1] 9744

# Add clustering results
df$cluster <- umap_df$cluster 
cluster_summary <- df %>%
  group_by(cluster) %>%
  summarise(
    avg_price = mean(price, na.rm=TRUE),
    avg_bedrooms = mean(bedrooms, na.rm=TRUE),
    avg_bathrooms = mean(bathrooms, na.rm=TRUE),
    avg_square_feet = mean(square_feet, na.rm=TRUE),
    count = n()
  )

# Display cluster statistics
print(cluster_summary)

## # A tibble: 29 × 6
##    cluster avg_price avg_bedrooms avg_bathrooms avg_square_feet count
##    <fct>       <dbl>        <dbl>         <dbl>           <dbl> <int>
##  1 1           1189.        0              1               470.    95
##  2 2           1163.        0              1               345     23
##  3 3           1744.        0              1.01            434.    56
##  4 4           1070.        1              1.00            720.   385
##  5 5           1068.        0.999          1.00            669.  1128
##  6 6           1939.        0.998          1.01            645.   522
##  7 7           1100.        1              1.01            621.   160
##  8 8           1595.        1              1               710.   641
##  9 9           1210.        1              1.00            593.   170
## 10 10           950.        0.998          1.00            680.  1039
## # ℹ 19 more rows

Price variation:
The prices of different clusters vary significantly, with some clusters (such as cluster 16) having an average price of up to $2,164, while some low-priced clusters (such as cluster 20) have an average price of only $787. These clusters may reflect different rental markets (high-end vs. affordable).
Distribution of size and number of bedrooms:
Cluster 16 shows larger size and more bedrooms, which may indicate high-end apartments or single-family homes. Clusters 1 and 2 have an average number of bedrooms close to 0, which may be mainly single-family apartments or small units.
Distribution of number of listings (count):
Some clusters (such as clusters 5 and cluster 16) have a large number of listings, while others (such as clusters 25 and cluster 29) have only a small number of listings.

# Visualize price distribution across clusters
ggplot(df, aes(x=as.factor(cluster), y=price, fill=as.factor(cluster))) +
  geom_boxplot() +
  theme_minimal() +
  ggtitle("Price Distribution Across Clusters") +
  xlab("Cluster") +
  ylab("Price")

Most clusters have prices between $500 and $5000.
Cluster 16 has the highest prices:
The highest prices in this cluster are over $50,000! This may correspond to high-end luxury apartments or houses.
Other clusters have relatively stable prices:
The prices of most clusters have not changed much, indicating that these properties have similar price structures.

# Visualize square feet vs. number of bedrooms
ggplot(df, aes(x=square_feet, y=bedrooms, color=as.factor(cluster))) +
  geom_point(alpha=0.5) +
  theme_minimal() +
  ggtitle("Cluster Analysis: Bedrooms vs Square Feet")

Bedrooms are mostly concentrated between 0-3:
Most listings have ≤ 3 bedrooms and are usually between 1000-3000 square feet. This may indicate that most of the rental market is dominated by small apartments and houses.
Some listings are particularly large:
There are a few listings with 6000-9000 square feet and 6+ bedrooms, which may be high-end homes or luxury apartments. You can see which cluster they belong to and whether they are associated with high prices.
Cluster distribution:
Listings with low bedroom counts (0-1) are mostly in the low-end price range. Larger listings (3+ bedrooms) may belong to different high-end clusters (such as 16, 17).

# Geographic visualization of clusters
ggplot(df, aes(x=longitude, y=latitude, color=as.factor(cluster))) +
  geom_point(alpha=0.5) +
  theme_minimal() +
  ggtitle("Cluster Distribution on Geographic Data")

The different colored clusters show some regionality:
Some clusters are distributed on the East Coast (such as New York, Washington, and Boston). Some clusters are concentrated on the West Coast (such as Los Angeles, San Francisco, and Seattle). Some are scattered in the central and southern regions.
Some areas have almost no data:
For example, some western states (such as Montana and Wyoming) have almost no listings, which may be because the dataset is mainly concentrated in areas with active rental markets.

# Correlation between price and square feet
correlation <- cor(df$price, df$square_feet, use="complete.obs")
print(paste("Correlation between Price and Square Feet: ", round(correlation, 3)))

## [1] "Correlation between Price and Square Feet:  0.465"

# Visualize price vs. square feet by cluster
ggplot(df, aes(x=square_feet, y=price, color=as.factor(cluster))) +
  geom_point(alpha=0.5) +
  theme_minimal() +
  ggtitle("Price vs Square Feet by Cluster")

# Log-linear regression analysis
lm_log <- lm(log(price) ~ square_feet + latitude + longitude + bedrooms + bathrooms, data = df)
summary(lm_log)

## 
## Call:
## lm(formula = log(price) ~ square_feet + latitude + longitude + 
##     bedrooms + bathrooms, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1038 -0.2609 -0.0421  0.2353  3.4419 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.020e+00  3.635e-02 165.598  < 2e-16 ***
## square_feet  3.233e-04  1.376e-05  23.489  < 2e-16 ***
## latitude     3.309e-03  7.346e-04   4.504 6.73e-06 ***
## longitude   -5.872e-03  2.557e-04 -22.966  < 2e-16 ***
## bedrooms    -1.526e-02  6.595e-03  -2.314   0.0207 *  
## bathrooms    1.459e-01  1.130e-02  12.909  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3951 on 9738 degrees of freedom
## Multiple R-squared:  0.2946, Adjusted R-squared:  0.2942 
## F-statistic: 813.2 on 5 and 9738 DF,  p-value: < 2.2e-16

Square Footage Still Drives Price, But Non-Linearly. Each extra square foot increases price by 0.032%, meaning price rises more slowly for larger houses. This supports diminishing returns—larger homes do get more expensive, but at a slower rate. Latitude is Now Significant. Unlike previous models where latitude had no effect, it is now statistically significant (p = 6.73e-06). Suggests higher latitudes (e.g., NYC, Chicago) have higher prices compared to southern areas. Longitude is Highly Significant. Price drops by 0.59% for each degree increase in longitude, confirming that Eastern U.S. is more expensive than the West. Bedrooms Still Show Negative Impact Each extra bedroom decreases price by 1.53%, which is counterintuitive. Possible explanations: Smaller high-end apartments (studios) are expensive, while larger homes with more bedrooms may have a lower price per square foot. Multicollinearity with square footage—more bedrooms usually mean bigger houses, but their impact on price may not be linear.
Bathrooms have the Strongest Impact. Each extra bathroom increases price by 14.59%, indicating that bathroom count is a key factor in luxury properties. Likely reflects premium housing (e.g., high-end apartments, villas) where extra bathrooms significantly raise value. Model Fit (R² = 29.46%) This model explains 29.46% of price variation, an improvement over the linear model (~25%). Still, 70% of price variation remains unexplained, suggesting location, property type, and market conditions play a bigger role.

Conclusion

The clustering results revealed distinct rental price segments, including low-cost small apartments, mid-range rentals, and high-end luxury properties.Cluster 16 emerged as the most expensive segment, likely representing luxury apartments or high-end homes. Longitude was a highly significant predictor of price, showing that East Coast properties tend to be more expensive than those in the West. Latitude became significant in the log-transformed model, indicating that higher latitudes (e.g., NYC, Chicago) tend to have higher rental prices. Bathrooms had the strongest positive impact on price, with each additional bathroom increasing rental price by 14.59%. Bedrooms showed a counterintuitive negative impact, possibly due to high-end studio apartments being more expensive per square foot.
Our final regression model explained 29.46% of price variation, indicating that while square footage, location, and amenities play a role, external factors like demand, neighborhood, and property type also heavily influence rental prices.