ABSTRACT

This study analyzes the complicated landscape of Manhattan real estate sales in 2008 using an extensive dataset from the Department of Finance (DOF). Applying k-means clustering, the study concentrates on residences belonging to Class 1-, 2-, and 3-Family and offers novel findings into the geographic distribution of house trades. The dataset includes all sales with a sale price of at least $150,000 that occurred between January 1st and December 31st, 2008.

The dataset’s hidden patterns and groupings can be observed by applying k-means clustering, which clarifies the spatial heterogeneity of Manhattan’s real estate market. Through the identification of groups with similar property sales characteristics, the analysis offers a deeper understanding of the borough’s heterogeneous real estate market. The findings have consequences for those involved in the property industry, urban planning, and legislation in addition to supporting academic investigations into urban economics and housing studies.

In summary, the research offers an innovative viewpoint on Manhattan’s property market in 2008, revealing geographical nuances that beyond standard assessments. This study provides to the expanding corpus of research on the complex nature of urban property markets through the application of k-means clustering. I utilized the clustering techniques to detect similarities and differences between properties. It also provides useful insights for those involved in negotiating the ever-changing real estate landscape in one of the most dynamic cities on the planet.

INFORMATION ABOUT THE DATA

This is a Non-Federal dataset covered by different terms of use than Data.gov

The link of the data source https://catalog.data.gov/dataset/dof-summary-of-neighborhood-sales-in-manhattan-for-class-1-2-and-3-family-homes-2008

DOF: Summary of Neighborhood Sales in Manhattan for Class 1-, 2- and 3-Family homes - 2008

Publisher data.cityofnewyork.us

Access & Use Information Public: This dataset is intended for public access and use. License: No license information was provided.

Resource Type Dataset Metadata Created Date November 10, 2020 Metadata Updated Date September 2, 2023

Maintainer
NYC OpenData

The Department of Finance (DOF) maintains records for all property sales in New York City, including sales of family homes in each borough. This list is a summary of neighborhood sales for Class 1-, 2- and 3-Family homes in Manhattan in 2008. This list includes all sales of 1-, 2-, and 3-Family Homes’ from January 1st, 2008 to December 31, 2008, whose sale price is equal to or more than $150,000. The Building Class Category for Sales is based on the Building Class at the time of the sale. Update Frequency: Annually

TABLE OF CONTENT

  1. ABSTRACT
  2. INFORMATION ABOUT THE DATA
  3. INTRODUCTION
  4. RESEARCH QUESTION
  5. DATA PREPARATION
  6. STATISTICAL ANALYSIS
  7. K-MEANS CLUSTERING
  8. VISUALIZATION OF CLUSTERING RESULTS
  9. FINDINGS
  10. CONCLUSION
  11. REFERENCES

INTRODUCTION

Since the 1990s, Manhattan’s housing prices have skyrocketed. While the demand side of this growth can be explained by factors such as declining interest rates and rising wages, some sluggishness in the supply of apartment complexes is required to explain the high and rising costs. The cost of adding a floor to any new building is the marginal cost of offering more housing in a market where high-rises predominate. Although there are virtually no natural obstacles to entry in the highly competitive home building sector, prices in Manhattan seem to be more than twice their supply costs at the moment.[1]

New York City’s urban topography is a representation of the city’s cultural richness as well as an active real estate market that constantly shifts the neighborhoods inside the city’s boundaries. The complex structure of the property market transactions in this rapidly changing city provide important new perspectives on housing patterns, economic fluctuations, and social dynamics. This study attempts to explore the vast amount of data contained in the Department of Finance’s extensive dataset, with a particular emphasis on real estate sales in Manhattan in 2008.

The Property Category classification, which offers an overview of the cityscape at the time of each sale, is a core of this dataset. A comprehensive investigation of the various manners in which the structure of Manhattan’s housing market influenced real estate sales during this time is possible because of to the House Type Categories for Sales. In addition to reflecting the exterior features of the properties, the building class provides insight on how communities are changing over time because different classes may be representative of different architectural trends, land uses, or historical significance. The data set’s reliability is further improved by updating it with a yearly frequency, which ensures that the data contains an in-depth analysis of real estate sales for the specified year. Because of this temporal regularity, academics may monitor trends, spot anomalies, and derive important insights into Manhattan’s dynamic real estate market.[2]

By navigating this extensive dataset, this study aims to illuminate the complex nature of real estate sales in one of the most dynamic urban environments on the planet.

RESEARCH QUESTION

How can Manhattan neighborhoods be effectively segmented based on property sales characteristics, specifically examining the average sale price and the number of sales for Class 1-, 2-, and 3-Family homes in 2008?

The objective of this research question is to identify discrete groupings or clusters of neighborhoods that have comparable pricing and sales volume profiles. It gives you the opportunity to investigate whether specific Manhattan neighborhoods show comparable property sales dynamics throughout the given time frame, offering insightful information about the geographic trends in real estate transactions. Very helpful if you want to know how different regions have different sales characteristics.

DATA PREPARATION

# 1.Changing the language to English
Sys.setlocale("LC_ALL","English")
## Warning in Sys.setlocale("LC_ALL", "English"): using locale code page other
## than 65001 ("UTF-8") may cause problems
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
Sys.setenv(LANGUAGE='en')

Installing the Packages

# Set the CRAN mirror
options(repos = c(CRAN = "https://cloud.r-project.org"))

install.packages("readr")
## package 'readr' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\User\AppData\Local\Temp\RtmpGkjBkT\downloaded_packages
install.packages("stats")
## Warning: package 'stats' is in use and will not be installed
install.packages("factoextra")
## package 'factoextra' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\User\AppData\Local\Temp\RtmpGkjBkT\downloaded_packages
install.packages("flexclust")
## package 'flexclust' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\User\AppData\Local\Temp\RtmpGkjBkT\downloaded_packages
install.packages("fpc")
## package 'fpc' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\User\AppData\Local\Temp\RtmpGkjBkT\downloaded_packages
install.packages("clustertend")
## package 'clustertend' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\User\AppData\Local\Temp\RtmpGkjBkT\downloaded_packages
install.packages("cluster")
## package 'cluster' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\User\AppData\Local\Temp\RtmpGkjBkT\downloaded_packages
install.packages("ClusterR")
## package 'ClusterR' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\User\AppData\Local\Temp\RtmpGkjBkT\downloaded_packages
install.packages("dplyr")
## package 'dplyr' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\User\AppData\Local\Temp\RtmpGkjBkT\downloaded_packages
install.packages("ggplot2")
## package 'ggplot2' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\User\AppData\Local\Temp\RtmpGkjBkT\downloaded_packages
install.packages("hopkins")
## package 'hopkins' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\User\AppData\Local\Temp\RtmpGkjBkT\downloaded_packages
install.packages("NbClust")
## package 'NbClust' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\User\AppData\Local\Temp\RtmpGkjBkT\downloaded_packages
install.packages("tidyverse")
## package 'tidyverse' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\User\AppData\Local\Temp\RtmpGkjBkT\downloaded_packages
install.packages("dendextend")
## package 'dendextend' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\User\AppData\Local\Temp\RtmpGkjBkT\downloaded_packages

Activating the packages with Library function

library(readr)
library(stats)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(flexclust)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Loading required package: stats4
library(grid)
library(lattice)
library(modeltools)
library(stats4)
library(hopkins)
library(fpc)
library(clustertend)
## Package `clustertend` is deprecated.  Use package `hopkins` instead.
## 
## Attaching package: 'clustertend'
## The following object is masked from 'package:hopkins':
## 
##     hopkins
library(cluster)
library(ClusterR)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(NbClust)
library(tidyverse)
## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v forcats 1.0.0     v tibble  3.2.1
## v purrr   1.0.2     v tidyr   1.3.0
## v stringr 1.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dendextend)
## 
## ---------------------
## Welcome to dendextend version 1.17.1
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags: 
##   https://stackoverflow.com/questions/tagged/dendextend
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------
## 
## 
## Attaching package: 'dendextend'
## 
## The following object is masked from 'package:stats':
## 
##     cutree

Assign the dataset to the data value

data <- read.csv("C:/Users/User/Desktop/UL Research 1/Data.csv")

The brief overview of the dataset

head(data)
##                NEIGHBORHOOD                                TYPE.OF.HOME
## 1 ALPHABET CITY             03 THREE FAMILY HOMES                      
## 2 CHELSEA                   01 ONE FAMILY HOMES                        
## 3 CHELSEA                   02 TWO FAMILY HOMES                        
## 4 CHELSEA                   03 THREE FAMILY HOMES                      
## 5 CHINATOWN                 02 TWO FAMILY HOMES                        
## 6 EAST VILLAGE              01 ONE FAMILY HOMES                        
##   NUMBER.OF.SALES LOWEST.SALE.PRICE AVERAGE.SALE.PRICE MEDIAN.SALE.PRICE
## 1               2           3200000            3275000           3275000
## 2               1           3800000            3800000           3800000
## 3               1           4388888            4388888           4388888
## 4               2           4280000            4640000           4640000
## 5               1           9425000            9425000           9425000
## 6               1           5300000            5300000           5300000
##   HIGHEST.SALE.PRICE
## 1            3350000
## 2            3800000
## 3            4388888
## 4            5000000
## 5            9425000
## 6            5300000
View(data)

The structure of the dataset

str(data)
## 'data.frame':    57 obs. of  7 variables:
##  $ NEIGHBORHOOD      : chr  "ALPHABET CITY            " "CHELSEA                  " "CHELSEA                  " "CHELSEA                  " ...
##  $ TYPE.OF.HOME      : chr  "03 THREE FAMILY HOMES                      " "01 ONE FAMILY HOMES                        " "02 TWO FAMILY HOMES                        " "03 THREE FAMILY HOMES                      " ...
##  $ NUMBER.OF.SALES   : int  2 1 1 2 1 1 1 1 3 1 ...
##  $ LOWEST.SALE.PRICE : num  3200000 3800000 4388888 4280000 9425000 ...
##  $ AVERAGE.SALE.PRICE: num  3275000 3800000 4388888 4640000 9425000 ...
##  $ MEDIAN.SALE.PRICE : num  3275000 3800000 4388888 4640000 9425000 ...
##  $ HIGHEST.SALE.PRICE: num  3350000 3800000 4388888 5000000 9425000 ...
nrow(data)
## [1] 57
ncol(data)
## [1] 7

This data frame is consists of 7 columns (variables) and 57 rows (observations). This file contains information (NEIGHBORHOOD, TYPE OF HOME, NUMBER OF SALES, LOWEST SALE PRICE, AVERAGE SALE PRICE, MEDIAN SALE PRICE, HIGHEST SALE PRICE) about the properties in Manhattan region.

As we focus on the dataset’s primary variables, we find a multitude of data captured in columns like “neighborhood,” “home type,” “number of sales,” “lowest sale price,” “average sale price,” “median sale price,” and “highest sale price.” All these factors together provide the foundation for a complex study. Understanding how various neighborhoods within Manhattan contributed to the total property sales landscape is made possible by the “neighborhood” variable, which makes it easier to explore geographic variances in detail.

The “type of home” variable provides an informative viewpoint on the variety of housing possibilities in Manhattan by classifying properties into 1-, 2-, and 3-family houses. This distinction is essential to understanding the needs and desires of urban residents because it clarifies whether larger, multi-family apartments or smaller, single-family homes were more common in the real estate deals that were examined.

A whole story of the financial characteristics of property sales is formed by the quantitative measurements “number of sales,” “lowest sale price,” “average sale price,” “median sale price,” and “highest sale price” taken together. These measures offer a comprehensive picture of the economic range of sales, from small-scale transactions to those involving luxury properties, in addition to quantifying the volume of transactions.

STATISTICAL ANALYSIS

  1. To begin with exploring the overall statistical result of the data
summary(data)
##  NEIGHBORHOOD       TYPE.OF.HOME       NUMBER.OF.SALES LOWEST.SALE.PRICE 
##  Length:57          Length:57          Min.   : 1      Min.   :  175000  
##  Class :character   Class :character   1st Qu.: 1      1st Qu.:  999000  
##  Mode  :character   Mode  :character   Median : 2      Median : 3450000  
##                                        Mean   : 3      Mean   : 3477236  
##                                        3rd Qu.: 3      3rd Qu.: 4400000  
##                                        Max.   :21      Max.   :13500000  
##  AVERAGE.SALE.PRICE MEDIAN.SALE.PRICE  HIGHEST.SALE.PRICE
##  Min.   :  200000   Min.   :  200000   Min.   :  200000  
##  1st Qu.: 1201754   1st Qu.: 1100000   1st Qu.: 2200000  
##  Median : 4388888   Median : 4388888   Median : 4388888  
##  Mean   : 4782088   Mean   : 4567053   Mean   : 6884256  
##  3rd Qu.: 6291667   3rd Qu.: 5770000   3rd Qu.: 7950000  
##  Max.   :15655000   Max.   :13825000   Max.   :49000000

This statistical summary of the data assists us to understand basic statistical structure of the dataset. From this summary we obtained that the minimum number of sales is 1 and the maximum number of sales is 21 in the market.Although the range is high however, the mean is 3. The minimum Sale prices for low level is 175000, for average, median, and highest level is 200k. The maximum sale prices for low level is 13 million and 500k, for average is 15M 655k, for median is 13M 825k and for highest level is 49 Million.

  1. Correlation Analysis:

Assess the correlation between variables, especially between “average sale price” and “number of sales.” This can give insights into how these variables might relate to each other.

cor(data$AVERAGE.SALE.PRICE, data$NUMBER.OF.SALES)
## [1] 0.3136815

In Manhattan areas, there is a moderately beneficial relationship between the average sale price and the number of sales, as indicated by the correlation coefficient of 0.31. There is a trend for the quantity of property sales to increase in tandem with average prices. A comprehensive understanding of this connection requires additional investigation and analysis of the contributing components.[3]

  1. Distribution Plots:
print(ggplot(data, aes(x = data$NUMBER.OF.SALES, y = data$TYPE.OF.HOME)) +
  geom_line(color = "black", alpha = 0.7) +
  labs(title = "The number of Sales in different home types"))
## Warning: Use of `data$NUMBER.OF.SALES` is discouraged.
## i Use `NUMBER.OF.SALES` instead.
## Warning: Use of `data$TYPE.OF.HOME` is discouraged.
## i Use `TYPE.OF.HOME` instead.

The printed line graph demonstrates number of sales and types of homes with lines. The most sold homes are one family homes which sold more than 20, then two family homes and the less sold houses are three family houses which sold less than 10.

  1. Box Plots:

Central tendency of “average sale price” and “number of sales.”

ggplot(data, aes(x = 1, y = data$AVERAGE.SALE.PRICE)) +
  geom_boxplot(fill = "red", color = "black") +
  labs(title = "Box Plot of Average Sale Price")
## Warning: Use of `data$AVERAGE.SALE.PRICE` is discouraged.
## i Use `AVERAGE.SALE.PRICE` instead.

ggplot(data, aes(x = 1, y = data$NUMBER.OF.SALES)) +
  geom_boxplot(fill = "green", color = "black") +
  labs(title = "Box Plot of Number of Sales")
## Warning: Use of `data$NUMBER.OF.SALES` is discouraged.
## i Use `NUMBER.OF.SALES` instead.

The box plots effectively illustrate the distribution, central tendency, and variability in “average sale price” and “number of sales.” These visualizations provide valuable insights into the housing market dynamics, aiding in a comprehensive understanding of the dataset’s key features.

Checking the NA values in the dataset. And illustration the number of missing values.

sum(is.na(data))
## [1] 0
colSums(is.na(data))
##       NEIGHBORHOOD       TYPE.OF.HOME    NUMBER.OF.SALES  LOWEST.SALE.PRICE 
##                  0                  0                  0                  0 
## AVERAGE.SALE.PRICE  MEDIAN.SALE.PRICE HIGHEST.SALE.PRICE 
##                  0                  0                  0
data[duplicated(data), ]
## [1] NEIGHBORHOOD       TYPE.OF.HOME       NUMBER.OF.SALES    LOWEST.SALE.PRICE 
## [5] AVERAGE.SALE.PRICE MEDIAN.SALE.PRICE  HIGHEST.SALE.PRICE
## <0 rows> (or 0-length row.names)

We may observe that there are no missing values on the dataset. The Data Cleaning and Handling with missing data steps is not needed for this dataset.If we would need to implement, we could delete rows which conclude na values with na.omit() function or we might substitute na values with mean/median.

The average prices in different home types

aggregate(data = data, data$AVERAGE.SALE.PRICE ~ data$TYPE.OF.HOME, mean, na.rm = TRUE)
##                             data$TYPE.OF.HOME data$AVERAGE.SALE.PRICE
## 1 01 ONE FAMILY HOMES                                         5578120
## 2 02 TWO FAMILY HOMES                                         4353506
## 3 03 THREE FAMILY HOMES                                       4502637

The mean average sale prices for Manhattan’s various family house types are revealed by the combined analysis. 5,578,120 is the mean price of One Family Homes, followed by Three Family Homes (4,502,637 USD) and Two Family Homes (4,353,506 USD) This information offers a basic overview of the pricing differences between the different types of family homes in the dataset.

The average house prices within the neighborhoods

aggregate(data = data, data$AVERAGE.SALE.PRICE ~ data$NEIGHBORHOOD, mean, na.rm = TRUE)
##            data$NEIGHBORHOOD data$AVERAGE.SALE.PRICE
## 1  ALPHABET CITY                           3275000.0
## 2  CHELSEA                                 4276296.0
## 3  CHINATOWN                               9425000.0
## 4  EAST VILLAGE                            4300000.0
## 5  FASHION                                 4125000.0
## 6  GRAMERCY                                7630000.0
## 7  GREENWICH VILLAGE-CENTRAL               6035095.0
## 8  GREENWICH VILLAGE-WEST                  8462777.7
## 9  HARLEM-CENTRAL                          1155572.7
## 10 HARLEM-EAST                             1143319.0
## 11 HARLEM-UPPER                            1467074.7
## 12 HARLEM-WEST                              604761.0
## 13 INWOOD                                   428500.0
## 14 KIPS BAY                                2250000.0
## 15 LITTLE ITALY                            3950000.0
## 16 LOWER EAST SIDE                         3280000.0
## 17 MANHATTAN VALLEY                        2750000.0
## 18 MIDTOWN EAST                           13210000.0
## 19 MURRAY HILL                             2586733.3
## 20 SOHO                                    4500000.0
## 21 TRIBECA                                 5770000.0
## 22 UPPER EAST SIDE (59-79)                11964519.0
## 23 UPPER EAST SIDE (79-96)                 6326115.0
## 24 UPPER WEST SIDE (59-79)                 7148033.3
## 25 UPPER WEST SIDE (79-96)                 5438194.7
## 26 UPPER WEST SIDE (96-116)                 499000.0
## 27 WASHINGTON HEIGHTS LOWER                 614618.3
## 28 WASHINGTON HEIGHTS UPPER                 200000.0

Diverse real estate patterns are revealed by breaking down Manhattan’s average prices by neighborhood. Every region displays distinct average sales prices, offering significant understanding of the complex processes of real estate pricing. Stakeholders rely on this information to guide strategic decisions and actions in the ever-changing Manhattan real estate market.

K-Means Clustering

J.B. invented k-means, one of the earliest clustering methods, in 1967. MacQueen (MacQueen, 1967) developed this.

k-Means In the world of data mining, the clustering algorithm is one of the most popular ones. The letter “k” in the name of the algorithm actually indicates the number of clusters. Algorithms that automatically divide data into smaller groups or subsets are known as clustering algorithms. Records that are statistically similar are grouped together by the algorithm.[4] An element can only be a part of one set. The value that symbolizes the center of the cluster is called the center of the cluster.The value at the center of the cluster is the representative value of the cluster and is called the medoid. The values in the set should be most similar to each other, Clusters should not be as similar as possible.[5]

By applying k-means clustering on the dataset, comparable data points are systematically grouped together to enable a deeper comprehension of underlying trends. Before dividing the data into discrete groups, the k-means algorithm, a partitioning technique, requires the number of clusters (k) to be specified.

By combining these transactions into clusters iteratively according to how similar their attributes are, the k-means method optimizes the clustering to reduce intra-cluster variability.[6]

z-score standardization

# Selecting the relevant feature
features <- data$AVERAGE.SALE.PRICE

# Standardize the data
standardized_features <- scale(features)

#Reshape the standardized data if needed (required for k-means)
features_matrix <- matrix(standardized_features, ncol = 1)

It is vital to standardize variables prior to clustering in order to guarantee equitable and significant comparisons among various attributes. Larger scale variables may dominate the clustering process in the absence of standardization, producing biased findings. By transforming variables to have a zero mean and a one standard deviation, standardization enables fair treatment of every attribute. By improving the accuracy and efficiency of clustering algorithms, this preprocessing stage encourages the assignment of more representative and accurate clusters.[7]

# Set the number of clusters (adjust as needed)
num_clusters <- 3

# K-means clustering
kmeans_result <- kmeans(features_matrix, centers = num_clusters)
#Add cluster labels to the original dataset
data$CLUSTER <- kmeans_result$cluster

#View the results
print(data[, c("NEIGHBORHOOD", "AVERAGE.SALE.PRICE", "CLUSTER")])
##                 NEIGHBORHOOD AVERAGE.SALE.PRICE CLUSTER
## 1  ALPHABET CITY                        3275000       3
## 2  CHELSEA                              3800000       3
## 3  CHELSEA                              4388888       3
## 4  CHELSEA                              4640000       3
## 5  CHINATOWN                            9425000       2
## 6  EAST VILLAGE                         5300000       3
## 7  EAST VILLAGE                         3300000       3
## 8  FASHION                              4125000       3
## 9  GRAMERCY                             7630000       3
## 10 GREENWICH VILLAGE-CENTRAL            5529097       3
## 11 GREENWICH VILLAGE-CENTRAL            5687500       3
## 12 GREENWICH VILLAGE-CENTRAL            6888688       3
## 13 GREENWICH VILLAGE-WEST              10583333       2
## 14 GREENWICH VILLAGE-WEST               8665000       2
## 15 GREENWICH VILLAGE-WEST               6140000       3
## 16 HARLEM-CENTRAL                       1459250       1
## 17 HARLEM-CENTRAL                       1149048       1
## 18 HARLEM-CENTRAL                        858420       1
## 19 HARLEM-EAST                          1250000       1
## 20 HARLEM-EAST                          1036638       1
## 21 HARLEM-UPPER                         2325000       1
## 22 HARLEM-UPPER                         1201754       1
## 23 HARLEM-UPPER                          874470       1
## 24 HARLEM-WEST                           604761       1
## 25 INWOOD                                428500       1
## 26 KIPS BAY                             1000000       1
## 27 KIPS BAY                             3500000       3
## 28 LITTLE ITALY                         3950000       3
## 29 LOWER EAST SIDE                      3280000       3
## 30 MANHATTAN VALLEY                     2750000       1
## 31 MIDTOWN EAST                        15655000       2
## 32 MIDTOWN EAST                        10625000       2
## 33 MIDTOWN EAST                        13350000       2
## 34 MURRAY HILL                          2306200       1
## 35 MURRAY HILL                          4455000       3
## 36 MURRAY HILL                           999000       1
## 37 SOHO                                 3550000       3
## 38 SOHO                                 5450000       3
## 39 TRIBECA                              5770000       3
## 40 UPPER EAST SIDE (59-79)             13968557       2
## 41 UPPER EAST SIDE (59-79)              8425000       2
## 42 UPPER EAST SIDE (59-79)             13500000       2
## 43 UPPER EAST SIDE (79-96)              9103345       2
## 44 UPPER EAST SIDE (79-96)              4725000       3
## 45 UPPER EAST SIDE (79-96)              5150000       3
## 46 UPPER WEST SIDE (59-79)              7733300       3
## 47 UPPER WEST SIDE (59-79)              8210800       2
## 48 UPPER WEST SIDE (59-79)              5500000       3
## 49 UPPER WEST SIDE (79-96)              6291667       3
## 50 UPPER WEST SIDE (79-96)              4576667       3
## 51 UPPER WEST SIDE (79-96)              5446250       3
## 52 UPPER WEST SIDE (96-116)              499000       1
## 53 WASHINGTON HEIGHTS LOWER              572917       1
## 54 WASHINGTON HEIGHTS LOWER              587213       1
## 55 WASHINGTON HEIGHTS LOWER              683725       1
## 56 WASHINGTON HEIGHTS UPPER              200000       1
## 57 WASHINGTON HEIGHTS UPPER              200000       1
# Print the cluster centers

print(kmeans_result$centers)
##          [,1]
## 1 -0.96960114
## 2  1.62718398
## 3  0.05742304
# Print the cluster size
print(kmeans_result$size)
## [1] 20 11 26
library(cluster)

# Assuming 'data' is your dataset and 'cluster' is the column with cluster labels
silhouette_score <- silhouette(data$CLUSTER, dist(data[, "AVERAGE.SALE.PRICE"]))

# Print the silhouette score
print(silhouette_score)
##       cluster neighbor  sil_width
##  [1,]       3        1 0.19248862
##  [2,]       3        1 0.48417508
##  [3,]       3        1 0.65704593
##  [4,]       3        1 0.69787609
##  [5,]       2        3 0.44756655
##  [6,]       3        1 0.74356305
##  [7,]       3        1 0.21141059
##  [8,]       3        1 0.59330228
##  [9,]       3        2 0.19794655
## [10,]       3        1 0.74508324
## [11,]       3        1 0.73739624
## [12,]       3        2 0.49784926
## [13,]       2        3 0.58299867
## [14,]       2        3 0.24670643
## [15,]       3        2 0.69407458
## [16,]       1        3 0.78902250
## [17,]       1        3 0.85263986
## [18,]       1        3 0.87301605
## [19,]       1        3 0.83599762
## [20,]       1        3 0.86576539
## [21,]       1        3 0.48188813
## [22,]       1        3 0.84475888
## [23,]       1        3 0.87293161
## [24,]       1        3 0.86630832
## [25,]       1        3 0.84663908
## [26,]       1        3 0.86892074
## [27,]       3        1 0.34105404
## [28,]       3        1 0.53980767
## [29,]       3        1 0.19645040
## [30,]       1        3 0.20546293
## [31,]       2        3 0.52408273
## [32,]       2        3 0.58534817
## [33,]       2        3 0.62277863
## [34,]       1        3 0.49136997
## [35,]       3        1 0.66991528
## [36,]       1        3 0.86897976
## [37,]       3        1 0.36862519
## [38,]       3        1 0.74678197
## [39,]       3        1 0.73219888
## [40,]       2        3 0.60385511
## [41,]       2        3 0.14477588
## [42,]       2        3 0.62061115
## [43,]       2        3 0.38069415
## [44,]       3        1 0.70671264
## [45,]       3        1 0.73710915
## [46,]       3        2 0.14300798
## [47,]       2        3 0.02756555
## [48,]       3        1 0.74603173
## [49,]       3        2 0.66134990
## [50,]       3        1 0.68957879
## [51,]       3        1 0.74677070
## [52,]       1        3 0.85577185
## [53,]       1        3 0.86390313
## [54,]       1        3 0.86516641
## [55,]       1        3 0.86963727
## [56,]       1        3 0.81387352
## [57,]       1        3 0.81387352
## attr(,"Ordered")
## [1] FALSE
## attr(,"call")
## silhouette.default(x = data$CLUSTER, dist = dist(data[, "AVERAGE.SALE.PRICE"]))
## attr(,"class")
## [1] "silhouette"
# If you want to get the average silhouette score
average_silhouette_score <- mean(silhouette_score[, "sil_width"])
print(average_silhouette_score)
## [1] 0.6124652

A higher silhouette score (closer to 1) indicates better-defined clusters, while a lower score (closer to -1) suggests overlapping or poorly separated clusters.

library(factoextra)
optimal_clusters <- fviz_nbclust(features_matrix, kmeans, method = "silhouette")
print(optimal_clusters)

library(factoextra)
optimal_clusters <- fviz_nbclust(features_matrix, kmeans, method = "wss")
print(optimal_clusters)

A crucial step in the K-means clustering study is figuring out the ideal number of clusters. The Elbow approach and the Silhouette method, two popular techniques, were utilized to facilitate the decision to proceed.

insights from the Silhouette approach: The Silhouette approach indicated that four clusters would be the ideal number, implying that each data point would be relatively comparable to its own cluster in comparison to other clusters. However, an agreement was reached, taking into account the pragmatic goal of producing a meaningful segmentation while maintaining a reasonable number of clusters.[8]

Elbow approach Insights: Plotting the within-cluster sum of squares (WCSS) compared to the number of clusters (k) is the visual method utilized by this technique. When the clusters grow smaller as k increases, the WCSS, a measure of the clusters’ compactness, drops. The ideal number of clusters is usually found around the “elbow”-shaped visual boundary when the WCSS begins to decline more slowly. The graphic shows that the elbow appears to be at the point where k = 3. On the other hand, the Elbow approach revealed a clear “elbow” at 3 clusters, indicating a point at which adding clusters does not considerably increase the explanatory power of the model. This approach is consistent with the parsimony principle, which states that when the explanatory value of a model increases with the number of clusters, a simpler model with fewer clusters is favored.[9]

Final Decision: Taking into account the simplicity principle and the knowledge gained from both approaches, we decided to take a more traditional approach and picked three clusters. This selection is in accordance with the main objective of developing an understandable and practical clustering solution by trying to find a balance between identifying significant patterns in the data and avoiding needless complexity.

VISUALIZATION OF CLUSTERING RESULTS

  1. Simple Plots of Clustering:
# Simple plot for 5 dimension cluster
plot(features_matrix, col = data$CLUSTER, main = "K-means Clustering",
     pch=".", cex=7)

# Visualize the clustering results if needed
plot(features_matrix, col = data$CLUSTER, main = "K-means Clustering")

  1. Scatter Plot of Clustering:
library(ggplot2)
ggplot(data, aes(x = features_matrix, y = data$TYPE.OF.HOME , color = data$CLUSTER)) +
  geom_point() +
  labs(title = "")
## Warning: Use of `data$TYPE.OF.HOME` is discouraged.
## i Use `TYPE.OF.HOME` instead.
## Warning: Use of `data$CLUSTER` is discouraged.
## i Use `CLUSTER` instead.

ggplot(data, aes(x = features_matrix, y = data$NUMBER.OF.SALES , color = data$CLUSTER)) +
  geom_point() +
  labs(title = "")
## Warning: Use of `data$NUMBER.OF.SALES` is discouraged.
## i Use `NUMBER.OF.SALES` instead.
## Warning: Use of `data$CLUSTER` is discouraged.
## i Use `CLUSTER` instead.

library(ggplot2)

ggplot(data, aes(x = features_matrix, y = 1, color = factor(CLUSTER))) +
  geom_point() +
  labs(title = "K-means Clustering of Prices",
       x = "Price",
       y = "") +
  theme_minimal()

Visualize the clusters by setting a scatter plot with the x-axis displaying the ‘average_sale_price,’ the y-axis set to a constant value, and each point colored by the assigned cluster. This aids in determining how clusters are distributed based on the average sale prices.

  1. Clus Plot by Cluster:
library(cluster)
clusplot(data, kmeans_result$cluster, color = TRUE, shade = TRUE, labels = 2, lines = 0) 

  1. Box Plot by Cluster:
ggplot(data, aes(x = factor(data$CLUSTER), y = features_matrix, fill = factor(data$CLUSTER))) +
  geom_boxplot() +
  labs(title = "Box Plot of Average Sale Prices by Cluster",
       x = "Cluster",
       y = "Average Sale Price") +
  theme_minimal()
## Warning: Use of `data$CLUSTER` is discouraged.
## i Use `CLUSTER` instead.
## Use of `data$CLUSTER` is discouraged.
## i Use `CLUSTER` instead.

A box plot which visualize the distribution of ‘average_sale_price’ within each cluster. This ensures insights into the spread and central tendency of prices in each cluster.

FINDINGS

  1. Visualization of Data and Investigation:   The preliminary data visualization and analysis highlighted clear trends in the distribution of average sale prices among the neighborhoods. The box plot illustrated price fluctuations, showing greater or lower average sale prices in particular neighborhoods. Particularly, this difference provides the way for additional research into elements that might affect real estate values in other locations.

  2. Quantitative Analysis: significant data was obtained from the correlation study between the quantity of sales and average sale prices. The average sale price and the number of sales have a positive association, as indicated by the correlation score that was produced (high, moderate, or weak). This suggests that there are typically more real estate transactions in neighborhoods with higher average sale prices. However since correlation is not equivalent to cause and effect, judgment is essential.

  3. Clustering Analysis: Based on average sale prices, neighborhoods were divided into separate groups using K-means clustering. The ideal number of clusters was established by applying the elbow approach. The ensuing clusters offer a methodical arrangement of communities, illuminating parallels and divergences in property values.

CONCLUSION

The comprehension of how the neighborhoods in the real estate dataset are segmented is improved by the incorporation of clustering results.

Cluster Characteristics: Every cluster denotes a distinct neighborhood sector with comparable average sales values. Latent patterns and groups that might not have been readily obvious through standard data exploration were found using the clustering technique.

Market Segmentation: The real estate market can be effectively divided into the recognized groupings. With this knowledge, stakeholders can modify their approaches and actions according to the features of particular neighborhood subgroups.

Correlation Validation: By clustering, the positive correlation between the quantity of sales and average sale prices is given more context. It implies that communities that are part of the same cluster have equivalent transaction volumes in addition to similar pricing trends.

Going forward, a more detailed investigation of the unique characteristics and dynamics of each cluster is advised. This might entail:

Cluster Profiling: Examine each cluster in-depth to determine the key features or factors that influence the average sale prices that have been observed.

Periodic Trends: Examine how clusters change over time while considering any modifications in market dynamics or temporal trends into account.

External Factors: Investigate at external factors that could affect real estate values within specific clusters, such as efforts to develop or economic indexes.

The comprehensive report provides an extensive viewpoint on neighborhood segmentation, creating the foundation for intelligent real estate decision-making. across an in-depth study of the distinct attributes of each cluster and a consideration of evolving circumstances, stakeholders may effectively move across the complex geography of the real estate industry. This all-encompassing strategy guarantees a sophisticated understanding, allowing decision-makers to adjust to shifting trends and optimize possibilities throughout every segmented neighborhood.

REFERENCES

  1. Edward L. Glaeser, Joseph Gyourko, and Raven Saks, Why Is Manhattan So Expensive? Regulation and the Rise in Housing Prices, The university of Chicago Press Journals

  2. Usman, H., Lizam, M., Adekunle, M.U. (2020). Property price modelling, marketsegmentation and submarket classifications: a review. Real Estate Management and Valuation, 28(3), 24-35.

  3. Walter S. Monroe, Dewey B. Stuit, The Interpretation of the Coefficient of Correlation, The Journal of Experimental Education, Vol. 1, No. 3, Experimental Techniques (Mar., 1933), pp. 186-203 (18 pages)

  4. Shi Na, Liu Xumin, Guan Yong, Research on k-means Clustering Algorithm: An Improved k-means Clustering Algorithm, 2010 Third International Symposium on Intelligent Information Technology and Security Informatics, 22 April 2010

  5. KAUFMAN L., and ROUSSEEUW P., Finding Groups in Data: An Introduction to Cluster Analysis, New York: J. Wiley & Son, 1990

  6. Shi, Donghui; Guan, Jian; Zurada, Jozef; and Levitan, Alan S. (2015) “An Innovative Clustering Approach to Market Segmentation for Improved Price Prediction,” Journal of International Technology and Information Management: Vol. 24: Iss. 1, Article 2.

  7. Peshawa Jamal Muhammad Ali*, and Rezhna Hassan Faraj, Data Normalization and Standardization: A Technical Report,Peshawa J. Muhammad Ali, Rezhna H. Faraj; “Data Normalization and Standardization: A Technical Report”, Machine Learning Technical Reports, 2014, 1(1), pp 1-6.

  8. Trupti M. Kodinariya1, Dr. Prashant R. Makwana, Review on determining number of Cluster in K-Means Clustering, International Journal of Advance Research in Computer Science and Management Studies, ISSN: 2321-7782 (Online), Volume 1, Issue 6, November 2013

  9. Congming Shi, Bingtao Wei, Shoulin Wei, Wen Wang, Hai Liu & Jialei Liu, A quantitative discriminant method of elbow point for the optimal number of clusters in clustering algorithm, EURASIP Journal on Wireless Communications and Networking, 15 February 2021