Notebook 02
The goal in this tutorial is to deconstruct the CompStak dataset by conducting a series of queries. We start with a completeness analysis of key variables aimed at finding which tenants have the highest flexible leases as well differences in transaction sizes. We then breakdown the lease variables and examine it by industry and location. The second part of the analysis looks at the kurtosis analysis for the data and examines the distribution in detail with a series of scatter plots. The final part analyzes the data by using simple clustering algorithms such as K-means and dendrograms.
| ### Part one: Rent analysis - concessions, benefits and rent bumps |
In this study case we will look at the difference between Current Rent and Effective Rent as a way to measure Flexible Rents which are a combination of concessions, free rent and lease bumps agreed upon by a tenant and the landlord. To do so We will make use of the Tidyverse framework and deconstruct our queries into a sequence of steps to visualize different data features. The challenge we face in this analysis is incomplete data, which requires an analytical strategy which we use to circumvent the problem.
library(tidyverse)
library(psych)
library(kableExtra)
library(knitr)
library(scales)
library(reshape2)
mydata_01 <- read.csv("mydata_01.csv", na.strings=c(""," ","NA"))
Rent_var <- mydata_01 %>%
select(
Free.Rent.Type,
Break.Option.Dates,
Break.Option.Type,
Rent.Bump.Year,
Rent.Bump.Dollar..USD.,
Work.Value..USD.,
Work.Type,
Renewal.Options
)
# VIM library for using 'aggr'
library(VIM)
# 'aggr' plots the amount of missing/imputed values in each column
a <- aggr(Rent_var, prop = T, numbers = F, combined = F,
labels = names(df), cex.axis = .7, oma = c(8,4,4,3))
summary(a)
##
## Missings per variable:
## Variable Count
## Free.Rent.Type 5228
## Break.Option.Dates 5228
## Break.Option.Type 5228
## Rent.Bump.Year 5226
## Rent.Bump.Dollar..USD. 5196
## Work.Value..USD. 4945
## Work.Type 4859
## Renewal.Options 4877
##
## Missings in combinations of variables:
## Combinations Count Percent
## 1:1:1:0:1:1:1:0 1 0.01912777
## 1:1:1:0:1:1:1:1 1 0.01912777
## 1:1:1:1:0:0:0:0 1 0.01912777
## 1:1:1:1:0:0:0:1 5 0.09563887
## 1:1:1:1:0:0:1:1 1 0.01912777
## 1:1:1:1:0:1:0:1 3 0.05738332
## 1:1:1:1:0:1:1:0 4 0.07651109
## 1:1:1:1:0:1:1:1 18 0.34429992
## 1:1:1:1:1:0:0:0 34 0.65034430
## 1:1:1:1:1:0:0:1 216 4.13159908
## 1:1:1:1:1:0:1:0 1 0.01912777
## 1:1:1:1:1:0:1:1 25 0.47819434
## 1:1:1:1:1:1:0:0 14 0.26778883
## 1:1:1:1:1:1:0:1 96 1.83626626
## 1:1:1:1:1:1:1:0 296 5.66182096
## 1:1:1:1:1:1:1:1 4512 86.30451415
Note that most of these variables are missing and thus gaining insights on rent concessions and bumps would be very challenging.We can take a different approach and examine the difference between the effective and current rent - this difference can serve as a proxy for the traditional concession analysis.
| # Top flexible spaces by tenants |
Rents_Analysis <- mydata_01 %>%
select(
Street.Address,
Submarket,
Tenant.Name,
Property.Type,
Tenant.Industry,
Effective_Rent,
Current_Rent,
Total.Transaction.Size,
Transaction.SQFT,
Expiration.Date,
Lease.Type,
Lease.Term) %>%
arrange(desc(Current_Rent))
head(Rents_Analysis)
The CompStak Enterprise Data Dictionary contains the definitions of these variables: + Current Rent (PSF): Rent per square foot at present time. + Net Effective Rent (PSF): Actual amount of rent paid (subtract lease concessions from starting rent).
CompStak Effective Rents are calculated in the following way:
More information on the Effective Rents calculation can be found on the CompStak website.
Have a look at the Compstak Excel ()
The mutate function creates a new variable benefit_ratio - the difference between the effective and current rents, yielding a ration of the implied lease concessions, benefits and bumps that are almost completely missing from the original database.
Rents_Analysis <- Rents_Analysis %>%
# Create two new variables
mutate(Flex_Rent = Effective_Rent - Current_Rent,
Space_Ratio = Total.Transaction.Size - Transaction.SQFT)
Rents_Analysis %>%
select(Tenant.Name, Submarket, Effective_Rent, Flex_Rent, Space_Ratio) %>%
arrange(desc(Flex_Rent))
#library(reshape2)
combined_graph = melt(Rents_Analysis[7:6])
ggplot(data = combined_graph) +
coord_cartesian(xlim = c( 0, 500)) +
ggtitle("Effective and Curren Rents unber $500") +
theme_bw() + xlab("Rent price") + ylab("Number of properties") +
geom_histogram( aes(x = value, y=(..count..)/sum(..count..), fill=variable),
alpha=1, binwidth=10, position="identity")
Chart_01 <- Rents_Analysis %>%
filter(Flex_Rent > 150)
d <- ggplot(Chart_01,
aes(x = Tenant.Name, y = Flex_Rent, fill = Current_Rent))
d + ggtitle("Top flexible leases - colored by Current_Rent") +
geom_bar(stat="identity") +
xlab("Tenant Name") + ylab("Flex_Rent") +
theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 9)) +
scale_fill_gradient(low="lightblue",high="orange")
Here we see that Bulgari located on 5th Avenue has the most flexible lease, closely followed by Chanel. In addition, it seems like Microsoft is getting a good deal - substantial benefits with an average effective rent
Rents_Analysis %>%
filter(Tenant.Name == "Bulgari")
| ## Transaction SQF vs Total Transaction Size |
The official definition from the glossary is:
Transaction Size (SF): Amount of square feet leased by tenant. Total Transaction Size (SF): Total amount of space leased by tenant in the transaction in square feet. Used when transaction size does not reflect entirety of space leased or used when multiple spaces are rented at different rates.
Let’s plot the data -
#library(reshape2)
combined_graph = melt(Rents_Analysis[7:6])
ggplot(data = combined_graph) +
coord_cartesian(xlim = c( 0, 5000)) +
ggtitle("Transaction size + total transaction size") +
theme_bw() +
xlab("Size of Transaction") +
ylab("Number of properties") +
geom_histogram( aes(x = value, y=(..count..)/sum(..count..), fill=variable),
alpha=1, binwidth= 100, position="identity")
Looks like there are some big differences between the two variables -
Transaction_Ratio <- Rents_Analysis %>%
select(
Tenant.Name,
Tenant.Industry,
Effective_Rent,
Current_Rent,
Space_Ratio,
Lease.Term ) %>%
arrange(desc(Space_Ratio))
head(Transaction_Ratio)
Chart_01 <- Transaction_Ratio %>%
filter(Space_Ratio > 25000)
d <- ggplot(Chart_01,
aes(x = Tenant.Name, y = Space_Ratio, fill = Current_Rent))
d + ggtitle("Top 20 transaction size ratios - color by Current rent") +
geom_bar(stat="identity") +
xlab("Tenant Name") + ylab("Space_Ratio = Total.Transaction.Size - Transaction.SQFT") +
theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 9)) +
scale_fill_gradient(low="lightblue",high="orange")
Nike has the highest difference, followed by Fox and Volvo - makes sense
Rents_Analysis %>%
filter(Tenant.Name == "Nike" )
Chart_01 <- Rents_Analysis %>%
filter(Flex_Rent > 150)
d <- ggplot(Chart_01,
aes(x = Tenant.Name, y = Flex_Rent, fill = Space_Ratio))
d + ggtitle("Top 20 Flexible leases - color by difference in transaction size") +
geom_bar(stat="identity") +
xlab("Tenant Name") + ylab("Flex_Rent (current - effective rents)") +
theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 9)) +
scale_fill_gradient(low="lightblue",high="orange")
Rents_Analysis %>%
filter(Tenant.Name == "H&M" )
| # Lease term analysis |
head(mydata_01$Lease.Term)
## [1] 10 years 10 years <NA> <NA> 10 years <NA>
## 136 Levels: 1 months 1 years 1 years,2 months ... 9 years,9 months
The goal is to reformat these descriptions into simple numeric variables
Lease_Analysis <- Rents_Analysis %>%
separate(Lease.Term, c("Year", "t1", "Month", "t2"))
# Change variable to numeric
Lease_Analysis [12:17] <- sapply(Lease_Analysis[12:17], as.numeric)
# Transform year into months
Lease_Analysis <- Lease_Analysis %>%
mutate(YearMonth = Year * 12) %>%
mutate(Lease_by_month = YearMonth + Month) %>%
select( -Year, -t1, -Month, -t2 )
# Pull all results into Lease_by_month variable
Lease_Analysis$Lease_by_month[is.na(Lease_Analysis$Lease_by_month)] <- Lease_Analysis$YearMonth[is.na(Lease_Analysis$Lease_by_month)]
lease_time <- Rents_Analysis %>%
select(Lease.Term)
Lease_Analysis <- cbind(Lease_Analysis, lease_time) %>%
mutate( Lease_by_year = Lease_by_month / 12)
Lease_Analysis %>%
select(Lease.Term, Lease.Term, YearMonth, Lease_by_month, Lease_by_year)
Lease_Analysis$Lease_by_year <- as.numeric(Lease_Analysis$Lease_by_year)
His_01 <- ggplot(Lease_Analysis,
aes(x = Lease_by_year)) +
# Limit X axis to 1000
coord_cartesian(xlim = c( 0, 20)) +
# Set histogram params
geom_histogram(binwidth= 1, aes(y=..density..,fill=Property.Type)) +
ggtitle("CompStak adjusted lease durations for 5,528 properties") +
scale_fill_brewer( palette="Spectral" , na.value="gray") + theme_bw()
His_01
# Using a clean set
lease_clean <- na.omit(Lease_Analysis)
Chart_01 <- lease_clean
d <- ggplot(Chart_01,
aes(x = Tenant.Industry, y = Lease_by_year, fill = Tenant.Industry))
d + ggtitle("Leases breakdown by industry (1,369 properties)") +
geom_bar(stat="identity") +
xlab("Tenant Industry") + ylab("Lease_by_year") +
theme_bw() + theme(legend.position="none", axis.text.x = element_text(angle = 45, hjust = 1, size = 9))
Chart_01 <- lease_clean %>%
filter(Lease_by_year == 10)
bp<- ggplot(Chart_01, aes(x=Lease_by_year, y="", fill= Tenant.Industry))+
geom_bar(width = 1, stat = "identity")
pie <- bp + coord_polar("y", start=0) + theme_bw() +
ggtitle(" Leases breakdown by industry (1,369 properties)") +
# Cleaning up graphics
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.text.x=element_blank(),
axis.title.x=element_blank(),
axis.ticks.y=element_blank(),
legend.text = element_text(color = "black", size = 5)
)
pie
Piping new variabels - workflow with mutate, group by and summarise
Lease_Descriptive <- Lease_Analysis %>%
select( - Tenant.Name,
- Tenant.Industry,
- Property.Type,
- Lease_by_month,
- Lease.Term,
- Lease.Type,
- YearMonth,
- Expiration.Date,
- Submarket)
describe(Lease_Descriptive)
Note -
Left_Tail <- Lease_Analysis %>%
filter (Current_Rent < 500,
Transaction.SQFT < 500)
d <- ggplot(Left_Tail,
aes((Transaction.SQFT), (Current_Rent), color = Submarket, size = Space_Ratio))
d + ggtitle("Small leases - 316 Properties in sample") +
scale_x_continuous(name = "Total Transaction Size (sqf)") +
scale_y_continuous(name = "Transaction amount") +
scale_colour_viridis_d(option = "spectral", na.value="pink") + theme_bw() +
theme( legend.text = element_text(color = "black", size = 5), axis.text.x = element_text(hjust = 1, size = 6)) +
geom_point()
Left_Tail %>%
select(Tenant.Name, Tenant.Industry, Submarket, Effective_Rent, Total.Transaction.Size, Lease_by_year ,Flex_Rent, Space_Ratio) %>%
arrange(desc(Effective_Rent))
Right_Tail <- Lease_Analysis %>%
filter(
Current_Rent > 1000,
Transaction.SQFT > 1000 )
d <- ggplot(Right_Tail,
aes((Current_Rent), (Transaction.SQFT), color = Tenant.Industry, size = Flex_Rent))
d + ggtitle("Large leases - 55 Properties") +
scale_x_continuous(name = "Transaction Size (Sqf)") +
scale_y_continuous(name = "Transaction amount") +
scale_colour_viridis_d(option = "spectral", na.value="red") + theme_bw() +
theme( legend.text = element_text(color = "black", size = 5), axis.text.x = element_text(hjust = 1, size = 6)) +
geom_point()
Right_Tail %>%
select(Tenant.Name, Tenant.Industry, Submarket, Effective_Rent, Total.Transaction.Size, Lease_by_year ,Flex_Rent, Space_Ratio) %>%
arrange(desc(Effective_Rent))
Note - fifth Ave.
### The middle -
Mid_dis <- Lease_Analysis %>%
filter(between(Current_Rent, 100, 1000),
between(Transaction.SQFT, 100, 5000))
d <- ggplot(Mid_dis,
aes((Transaction.SQFT), (Current_Rent), color = Submarket))
d + ggtitle("Transaction size over Effective rent - 2,594 Properties") +
scale_x_continuous(name = "Transaction Size (Sqf)") +
scale_y_continuous(name = "Transaction amount") +
scale_colour_viridis_d(option = "spectral", na.value="red") + theme_bw() +
theme( legend.text = element_text(color = "black", size = 5), axis.text.x = element_text(hjust = 1, size = 6)) +
geom_point()
Mid_dis %>%
select(Tenant.Name, Tenant.Industry, Submarket, Effective_Rent, Total.Transaction.Size, Lease_by_year ,Flex_Rent, Space_Ratio) %>%
arrange(desc(Effective_Rent))
describe(Lease_Analysis )
filter(between(Current_Rent, 100, 1000), between(Transaction.SQFT, 100, 5000))
Chart_01 <- Lease_Analysis %>%
filter( Flex_Rent > 30, between(Transaction.SQFT, 100, 5000))
d <- ggplot(Chart_01,
aes(log(Current_Rent), (Transaction.SQFT), color = Submarket, size = Space_Ratio))
d + ggtitle("Small leases - 199 Properties in sample") +
# scale_x_continuous(name = "Total Transaction Size (sqf)") +
# scale_y_continuous(name = "Transaction amount") +
scale_colour_viridis_d(option = "spectral", na.value="pink") + theme_bw() +
theme( legend.text = element_text(color = "black", size = 5), axis.text.x = element_text(hjust = 1, size = 9)) +
geom_point()
NOTE - IS THIS THE RIGHT GRAPH??
| # K-means Cluster Analysis |
This part of the notebook serves as an introduction to the k-means clustering method. Clustering is a broad set of techniques for finding subgroups of observations within a data set. When we cluster observations, we want observations in the same group to be similar and observations in different groups to be dissimilar. Because there isn’t a response variable, this is an unsupervised method, which implies that it seeks to find relationships between the n observations without being trained by a response variable. Clustering allows us to identify which observations are alike, and potentially categorize them therein. K-means clustering is the simplest and the most commonly used clustering method for splitting a dataset into a set of k groups.
library(tidyverse) # data manipulation
library(cluster) # clustering algorithms
library(factoextra) # clustering algorithms & visualization
To remove any missing value that might be present in the data, type this:
A different (and possibly better idea) would be to impute (at least some of) the missing observations.
Clean_List <- na.omit(Lease_Analysis)
Clean_List
To perform a cluster analysis in R, generally, the data should be prepared as follows:
By_Tenant.Industry <- Clean_List %>%
# Grouping by Tenant industry -->
group_by(Tenant.Industry) %>%
dplyr::summarise(
Mean_eff_rent = mean(Effective_Rent),
# Mean_Space_ratio = mean(Space_Ratio),
# Mean_flex_rent = mean(Flex_Rent),
Lease = mean(Lease_by_year),
Mean_tran_sqf = mean(Transaction.SQFT)) %>%
arrange(desc(Mean_eff_rent))
By_Tenant.Industry
IMPORTANT Keep the names of the variables by turning them to row names
By_Tenant.Industry <- By_Tenant.Industry %>%
remove_rownames %>%
column_to_rownames(var="Tenant.Industry")
Take a look -
By_Tenant.Industry
As we don’t want the clustering algorithm to depend to an arbitrary variable unit, we start by scaling/standardizing the data using the R function scale:
By_Tenant.Industry_scale <- scale(By_Tenant.Industry)
head(By_Tenant.Industry_scale)
## Mean_eff_rent Lease Mean_tran_sqf
## Software & Information 3.8215541 2.51132559 -0.063034313
## Apparel 0.2658731 0.02837450 -0.001967983
## Consumer Durables 0.1080052 -0.41702644 -0.632198139
## Banks -0.1652521 0.01808247 -0.053494492
## Retail -0.1758100 0.49121545 -0.087929836
## Leisure & Restaurants -0.2554199 0.09613033 -0.683210840
The classification of observations into groups requires some methods for computing the distance or the (dis)similarity between each pair of observations. The result of this computation is known as a dissimilarity or distance matrix. There are many methods to calculate this distance information; the choice of distance measures is a critical step in clustering. It defines how the similarity of two elements (x, y) is calculated and it will influence the shape of the clusters.
The choice of distance measures is a critical step in clustering. It defines how the similarity of two elements (x, y) is calculated and it will influence the shape of the clusters. The classical methods for distance measures are Euclidean and Manhattan distances, which are defined as follow:
\[ d_{euc}(x,y) = \sqrt{\sum^n_{i=1}(x_i - y_i)^2} \tag{1} \] + Manhattan distance:
\[ d_{man}(x,y) = \sum^n_{i=1}|(x_i - y_i)| \tag{2} \] Where, x and y are two vectors of length n.
Other dissimilarity measures exist such as correlation-based distances, which is widely used for gene expression data analyses. Correlation-based distance is defined by subtracting the correlation coefficient from 1. Different types of correlation methods can be used such as:
\[ d_{cor}(x, y) = 1 - \frac{\sum^n_{i=1}(x_i-\bar x)(y_i - \bar y)}{\sqrt{\sum^n_{i=1}(x_i-\bar x)^2\sum^n_{i=1}(y_i - \bar y)^2}} \tag{3} \] + Spearman correlation distance:
\[ d_{spear}(x, y) = 1 - \frac{\sum^n_{i=1}(x^\prime_i-\bar x^\prime)(y^\prime_i - \bar y^\prime)}{\sqrt{\sum^n_{i=1}(x^\prime_i-\bar x^\prime)^2\sum^n_{i=1}(y^\prime_i - \bar y^\prime)^2}} \tag{4} \]
The spearman correlation method computes the correlation between the rank of x and the rank of y variables.
Kendall correlation method measures the correspondence between the ranking of x and y variables. The total number of possible pairings of x with y observations is n(n − 1)/2, where n is the size of x and y. Begin by ordering the pairs by the x values. If x and y are correlated, then they would have the same relative rank orders. Now, for each y i , count the number of y j > y i (concordant pairs (c)) and the number of y j < y i (discordant pairs (d)).
\[ d_{kend}(x,y) = 1 - \frac{n_c - n_d}{\frac{1}{2}n(n - 1)} \tag{5} \] NOTE - The choice of distance measures is very important, as it has a strong influence on the clustering results. For most common clustering software, the default distance measure is the Euclidean distance. However, depending on the type of the data and the research questions, other dissimilarity measures might be preferred and you should be aware of the options.
Within R it is simple to compute and visualize the distance matrix using the functions get_dist and fviz_dist from the factoextra R package. This starts to illustrate which states have large dissimilarities (red) versus those that appear to be fairly similar (teal).
get_dist: for computing a distance matrix between the rows of a data matrix. The default distance computed is the Euclidean; however, get_dist also supports distanced described in equations 2-5 above plus others.
fviz_dist: for visualizing a distance matrix
distance <- get_dist(By_Tenant.Industry_scale)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
#This starts to illustrate which states have large dissimilarities (red) versus those that appear to be fairly similar (teal).
By_Submarket <- Clean_List %>%
# Grouping by Tenant industry -->
group_by(Submarket) %>%
dplyr::summarise(
Mean_eff_rent = mean(Effective_Rent),
# Mean_flex_rent = mean(Flex_Rent),
Lease = mean(Lease_by_year),
Mean_tran_sqf = mean(Transaction.SQFT)) %>%
arrange(desc(Mean_eff_rent))
By_Submarket
By_Submarket <- By_Submarket %>%
remove_rownames %>%
column_to_rownames(var="Submarket")
By_Submarket
By_Submarket_scale <- scale(By_Submarket)
head(By_Submarket_scale)
## Mean_eff_rent Lease Mean_tran_sqf
## Madison/Fifth Avenue 4.70273406 0.8256775 0.00198382
## Penn Station 0.34778694 1.1765811 -0.21979335
## Park Avenue 0.22334108 -0.3037934 -0.49473198
## Sixth Avenue 0.11906771 1.2382633 -0.83000406
## Times Square South -0.04406815 -0.1853634 -0.22877683
## Columbus Circle -0.08163094 -0.3037934 -1.17769773
distance <- get_dist(By_Submarket_scale)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
#This starts to illustrate which states have large dissimilarities (red) versus those that appear to be fairly similar (teal).
This heat map illustrates which submarkets have large dissimilarities (red) versus those that appear to be fairly similar (teal).
| ## K-Means Clustering |
The basic idea behind k-means clustering consists of defining clusters so that the total intra-cluster variation (known as total within-cluster variation) is minimized. There are several k-means algorithms available. The standard algorithm is the Hartigan-Wong algorithm (1979), which defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid:
$$
W(C_k) = _{x_i C_k}(x_i - _k)^2
$$
Where: x i is a data point belonging to the cluster C k μ k is the mean value of the points assigned to the cluster C k Each observation ( x i ) is assigned to a given cluster such that the sum of squares (SS) distance of the observation to their assigned cluster centers ( μ k ) is minimized.
The first step when using k-means clustering is to indicate the number of clusters (k) that will be generated in the final solution. The algorithm starts by randomly selecting k objects from the data set to serve as the initial centers for the clusters. The selected objects are also known as cluster means or centroids. Next, each of the remaining objects is assigned to it’s closest centroid, where closest is defined using the Euclidean distance between the object and the cluster mean. This step is called “cluster assignment step”.
After the assignment step, the algorithm computes the new mean value of each cluster. The term cluster “centroid update” is used to design this step. Now that the centers have been recalculated, every observation is checked again to see if it might be closer to a different cluster. All the objects are reassigned again using the updated cluster means. The cluster assignment and centroid update steps are iteratively repeated until the cluster assignments stop changing (i.e until convergence is achieved). That is, the clusters formed in the current iteration are the same as those obtained in the previous iteration.
K-means algorithm can be summarized as follows: Specify the number of clusters (K) to be created (by the analyst) Select randomly k objects from the data set as the initial cluster centers or means Assigns each observation to their closest centroid, based on the Euclidean distance between the object and the centroid For each of the k clusters update the cluster centroid by calculating the new mean values of all the data points in the cluster. The centroid of a Kth cluster is a vector of length p containing the means of all variables for the observations in the kth cluster; p is the number of variables. Iteratively minimize the total within sum of square (Eq. 7). That is, iterate steps 3 and 4 until the cluster assignments stop changing or the maximum number of iterations is reached. By default, the R software uses 10 as the default value for the maximum number of iterations.
| ### Computing k-means clustering in R |
We can compute k-means in R with the kmeans function. Here will group the data into two clusters (centers = 3). The kmeans function also has an nstart option that attempts multiple initial configurations and reports on the best one. For example, adding nstart = 25 will generate 25 initial configurations. This approach is often recommended.
k2 <- kmeans(By_Tenant.Industry, centers = 3, nstart = 25)
str(k2)
## List of 9
## $ cluster : Named int [1:17] 2 3 3 3 3 3 3 3 3 1 ...
## ..- attr(*, "names")= chr [1:17] "Software & Information" "Apparel" "Consumer Durables" "Banks" ...
## $ centers : num [1:3, 1:3] 81 4471.9 224.6 13.3 20 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "1" "2" "3"
## .. ..$ : chr [1:3] "Mean_eff_rent" "Lease" "Mean_tran_sqf"
## $ totss : num 2.51e+08
## $ withinss : num [1:3] 16869250 0 15561387
## $ tot.withinss: num 32430638
## $ betweenss : num 2.19e+08
## $ size : int [1:3] 3 1 13
## $ iter : int 2
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
The output of kmeans is a list with several bits of information. The most important being:
cluster: A vector of integers (from 1:k) indicating the cluster to which each point is allocated. centers: A matrix of cluster centers. totss: The total sum of squares. withinss: Vector of within-cluster sum of squares, one component per cluster. tot.withinss: Total within-cluster sum of squares, i.e. sum(withinss). betweenss: The between-cluster sum of squares, i.e. \(totss-tot.withinss\). size: The number of points in each cluster.
If we print the results we’ll see that our groupings resulted in 2 cluster sizes of 30 and 20. We see the cluster centers (means) for the two groups across the four variables (Murder, Assault, UrbanPop, Rape). We also get the cluster assignment for each observation (i.e. Alabama was assigned to cluster 2, Arkansas was assigned to cluster 1, etc.).
k2
## K-means clustering with 3 clusters of sizes 3, 1, 13
##
## Cluster means:
## Mean_eff_rent Lease Mean_tran_sqf
## 1 81.0500 13.333333 12145.667
## 2 4471.9200 20.000000 4500.000
## 3 224.6072 9.025285 3050.603
##
## Clustering vector:
## Software & Information
## 2
## Apparel
## 3
## Consumer Durables
## 3
## Banks
## 3
## Retail
## 3
## Leisure & Restaurants
## 3
## Food & Beverage
## 3
## Media
## 3
## Legal Services
## 3
## Automobiles & Components
## 1
## Education
## 3
## Non-Profit
## 3
## Financial Services
## 3
## Transportation, Warehousing & Storage
## 3
## Real Estate
## 1
## Health Care Equipment & Services
## 3
## Public Institutions
## 1
##
## Within cluster sum of squares by cluster:
## [1] 16869250 0 15561387
## (between_SS / total_SS = 87.1 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
We can also view our results by using fviz_cluster. This provides a nice illustration of the clusters. If there are more than two dimensions (variables) fviz_cluster will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.
fviz_cluster(k2, data = By_Tenant.Industry, labelsize = 9) + theme_bw()
k2 <- kmeans(By_Tenant.Industry_scale, centers = 3, nstart = 25)
str(k2)
## List of 9
## $ cluster : Named int [1:17] 2 3 3 3 3 3 3 3 3 1 ...
## ..- attr(*, "names")= chr [1:17] "Software & Information" "Apparel" "Consumer Durables" "Banks" ...
## $ centers : num [1:3, 1:3] -0.35 3.822 -0.213 0.762 2.511 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "1" "2" "3"
## .. ..$ : chr [1:3] "Mean_eff_rent" "Lease" "Mean_tran_sqf"
## $ totss : num 48
## $ withinss : num [1:3] 2.31 0 6.5
## $ tot.withinss: num 8.81
## $ betweenss : num 39.2
## $ size : int [1:3] 3 1 13
## $ iter : int 2
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
fviz_cluster(k2, data = By_Tenant.Industry_scale, labelsize = 9) + theme_bw()
By_Tenant.Industry_scale
## Mean_eff_rent Lease
## Software & Information 3.8215541 2.51132559
## Apparel 0.2658731 0.02837450
## Consumer Durables 0.1080052 -0.41702644
## Banks -0.1652521 0.01808247
## Retail -0.1758100 0.49121545
## Leisure & Restaurants -0.2554199 0.09613033
## Food & Beverage -0.2702006 0.29510949
## Media -0.2921693 -1.87372042
## Legal Services -0.2971250 -0.55055192
## Automobiles & Components -0.3053644 1.19909237
## Education -0.3255679 -0.11314085
## Non-Profit -0.3258339 -0.63803414
## Financial Services -0.3324434 -1.06450993
## Transportation, Warehousing & Storage -0.3406812 -0.11314085
## Real Estate -0.3509726 -0.11314085
## Health Care Equipment & Services -0.3659727 -0.95515717
## Public Institutions -0.3926194 1.19909237
## Mean_tran_sqf
## Software & Information -0.063034313
## Apparel -0.001967983
## Consumer Durables -0.632198139
## Banks -0.053494492
## Retail -0.087929836
## Leisure & Restaurants -0.683210840
## Food & Beverage -0.677571659
## Media -0.414746815
## Legal Services -0.917891090
## Automobiles & Components 1.376265362
## Education -0.424363954
## Non-Profit -0.637249461
## Financial Services -0.636791502
## Transportation, Warehousing & Storage -0.112493884
## Real Estate 1.634554231
## Health Care Equipment & Services -0.470356121
## Public Institutions 2.802480495
## attr(,"scaled:center")
## Mean_eff_rent Lease Mean_tran_sqf
## 449.1155 10.4311 4740.8732
## attr(,"scaled:scale")
## Mean_eff_rent Lease Mean_tran_sqf
## 1052.661940 3.810298 3821.302884
Because the number of clusters (k) must be set before we start the algorithm, it is often advantageous to use several different values of k and examine the differences in the results. We can execute the same process for 3, 4, and 5 clusters, and the results are shown in the figure:
k3 <- kmeans(By_Tenant.Industry_scale, centers = 3, nstart = 25)
k4 <- kmeans(By_Tenant.Industry_scale, centers = 4, nstart = 25)
k5 <- kmeans(By_Tenant.Industry_scale, centers = 5, nstart = 25)
# plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = By_Tenant.Industry_scale) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point", data = By_Tenant.Industry_scale) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point", data = By_Tenant.Industry_scale) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point", data = By_Tenant.Industry_scale) + ggtitle("k = 5")
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
grid.arrange(p1, p2, p3, p4, nrow = 2)
Although this visual assessment tells us where true delineations occur (or do not occur such as clusters 2 & 4 in the k = 5 graph) between clusters, it does not tell us what the optimal number of clusters is.
| Determining Optimal Clusters |
As you may recall the analyst specifies the number of clusters to use; preferably the analyst would like to use the optimal number of clusters. To aid the analyst, the following explains the three most popular methods for determining the optimal clusters, which includes:
Recall that, the basic idea behind cluster partitioning methods, such as k-means clustering, is to define clusters such that the total intra-cluster variation (known as total within-cluster variation or total within-cluster sum of square) is minimized:
$$
minimize(^k_{k=1}W(C_k))
$$
Where C k is the k t h cluster and W ( C k ) is the within-cluster variation. The total within-cluster sum of square (wss) measures the compactness of the clustering and we want it to be as small as possible. Thus, we can use the following algorithm to define the optimal clusters:
Fortunately, this process to compute the “Elbow method” has been wrapped up in a single function (fviz_nbclust):
set.seed(123)
fviz_nbclust(By_Tenant.Industry_scale, kmeans, method = "wss")
The results suggest that 3 is the optimal number of clusters as it appears to be the bend in the knee (or elbow).
In short, the average silhouette approach measures the quality of a clustering. That is, it determines how well each object lies within its cluster. A high average silhouette width indicates a good clustering. The average silhouette method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximizes the average silhouette over a range of possible values for k.2
Similar to the elbow method, this process to compute the “average silhouette method” has been wrapped up in a single function (fviz_nbclust):
fviz_nbclust(By_Tenant.Industry_scale, kmeans, method = "silhouette")
The results show that 3 clusters maximize the average silhouette values with 2 clusters coming in as second optimal number of clusters.
The gap statistic has been published by R. Tibshirani, G. Walther, and T. Hastie (Stanford University, 2001). The approach can be applied to any clustering method (i.e. K-means clustering, hierarchical clustering). The gap statistic compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data (i.e. a distribution with no obvious clustering). The reference dataset is generated using Monte Carlo simulations of the sampling process. That is, for each variable ( x i ) in the data set we compute its range [ m i n ( x i ) , m a x ( x j ) ] and generate values for the n points uniformly from the interval min to max. For the observed data and the the reference data, the total intracluster variation is computed using different values of k. The gap statistic for a given k is defined as follow:
$$
Gap_n(k) = E^*_n{log(W_k)} - log(W_k)
$$
Where E ∗ n denotes the expectation under a sample size n from the reference distribution. E ∗ n is defined via bootstrapping (B) by generating B copies of the reference datasets and, by computing the average l o g ( W ∗ k ) . The gap statistic measures the deviation of the observed W k value from its expected value under the null hypothesis. The estimate of the optimal clusters ( ^ k ) will be the value that maximizes G a p n ( k ) . This means that the clustering structure is far away from the uniform distribution of points.
To compute the gap statistic method we can use the clusGap function which provides the gap statistic and standard error for an output.
# compute gap statistic
set.seed(123)
gap_stat <- clusGap(By_Tenant.Industry_scale, FUN = kmeans, nstart = 25,
K.max = 10, B = 50)
# Print the result
print(gap_stat, method = "firstmax")
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = By_Tenant.Industry_scale, FUNcluster = kmeans, K.max = 10, B = 50, nstart = 25)
## B=50 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
## --> Number of clusters (method 'firstmax'): 1
## logW E.logW gap SE.sim
## [1,] 2.0518666 2.4018278 0.3499612 0.10189256
## [2,] 1.7521254 2.0265052 0.2743799 0.10432996
## [3,] 1.2623034 1.7333218 0.4710184 0.09407100
## [4,] 0.9299053 1.4933641 0.5634589 0.09540266
## [5,] 0.7489468 1.2895103 0.5405634 0.10033507
## [6,] 0.5258161 1.0991647 0.5733486 0.10992195
## [7,] 0.3372341 0.9163246 0.5790905 0.12369726
## [8,] 0.1169069 0.7234686 0.6065617 0.13651397
## [9,] -0.2020159 0.5238024 0.7258182 0.14393541
## [10,] -0.3930782 0.2986659 0.6917441 0.16063029
We can visualize the results with fviz_gap_stat which suggests a single cluster as the optimal number of clusters. (>>>)
fviz_gap_stat(gap_stat)
| ### Extracting Results |
With most of these approaches suggesting 3 as the number of optimal clusters, we can perform the final analysis and extract the results using 4 clusters.
# Compute k-means clustering with k = 4
set.seed(123)
final <- kmeans(By_Tenant.Industry_scale, 4, nstart = 25)
print(final)
## K-means clustering with 4 clusters of sizes 3, 5, 1, 8
##
## Cluster means:
## Mean_eff_rent Lease Mean_tran_sqf
## 1 -0.3496521 0.76168130 1.93776670
## 2 -0.3227088 -1.01639472 -0.61540700
## 3 3.8215541 2.51132559 -0.06303431
## 4 -0.1448817 0.03570051 -0.33415385
##
## Clustering vector:
## Software & Information
## 3
## Apparel
## 4
## Consumer Durables
## 4
## Banks
## 4
## Retail
## 4
## Leisure & Restaurants
## 4
## Food & Beverage
## 4
## Media
## 2
## Legal Services
## 2
## Automobiles & Components
## 1
## Education
## 4
## Non-Profit
## 2
## Financial Services
## 2
## Transportation, Warehousing & Storage
## 4
## Real Estate
## 1
## Health Care Equipment & Services
## 2
## Public Institutions
## 1
##
## Within cluster sum of squares by cluster:
## [1] 2.306732 1.258537 0.000000 1.496645
## (between_SS / total_SS = 89.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
We can visualize the results using fviz_cluster:
fviz_cluster(final, data = By_Tenant.Industry_scale) + theme_bw()
We can extract the clusters and add to our initial data to do some descriptive statistics at the cluster level:
By_Tenant.Industry %>%
mutate(Cluster = final$cluster) %>%
group_by(Cluster) %>%
summarise_all("mean")
| ### K-means Comments - |
K-means clustering is a very simple and fast algorithm. Furthermore, it can efficiently deal with very large data sets. However, there are some weaknesses of the k-means approach. One potential disadvantage of K-means clustering is that it requires us to pre-specify the number of clusters. Hierarchical clustering is an alternative approach which does not require that we commit to a particular choice of clusters. Hierarchical clustering has an added advantage over K-means clustering in that it results in an attractive tree-based representation of the observations, called a dendrogram. A future tutorial will illustrate the hierarchical clustering approach. In addition to these commonly used approaches, the NbClust package, published by Charrad et al., 2014, provides 30 indices for determining the relevant number of clusters and proposes to users the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods.
An additional disadvantage of K-means is that it’s sensitive to outliers and different results can occur if you change the ordering of your data. The Partitioning Around Medoids (PAM) clustering approach is less sensitive to outliers and provides a robust alternative to k-means to deal with these situations. A future tutorial will illustrate the PAM clustering approach.
| ## dendrogram and ggdendrogram |
Hierarchical clustering is where you build a cluster tree (a dendrogram) to represent data, where each group (or “node”) links to two or more successor groups. The groups are nested and organized as a tree, which ideally ends up as a meaningful classification scheme.
Each node in the cluster tree contains a group of similar data; Nodes group on the graph next to other, similar nodes. Clusters at one level join with clusters in the next level up, using a degree of similarity; The process carries on until all nodes are in the tree, which gives a visual snapshot of the data contained in the whole set. The total number of clusters is not predetermined before you start the tree creation.
library("ggplot2")
library("ggdendro")
# Compute distances and hierarchical clustering
dd <- dist(scale(By_Tenant.Industry), method = "euclidean")
hc <- hclust(dd, method = "ward.D2")
ggdendrogram(hc, rotate = TRUE, theme_dendro = FALSE) + theme_bw()
Two different algorithms are found in the literature for Ward clustering. The one used by option “ward.D” (equivalent to the only Ward option “ward” in R versions <= 3.0.3) does not implement Ward’s (1963) clustering criterion, whereas option “ward.D2” implements that criterion (Murtagh and Legendre 2014). With the latter, the dissimilarities are squared before cluster updating. Note that agnes(, method=“ward”) corresponds to hclust(, “ward.D2”).
Ward’s minimum variance criterion minimizes the total within-cluster variance. To implement this method, at each step find the pair of clusters that leads to minimum increase in total within-cluster variance after merging. This increase is a weighted squared distance between cluster centers. At the initial step, all clusters are singletons (clusters containing a single point). This function performs a hierarchical cluster analysis using a set of dissimilarities for the n objects being clustered. Initially, each object is assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster. At each stage distances between clusters are recomputed by the Lance–Williams dissimilarity update formula according to the particular clustering method being used. Ward’s minimum variance method aims at finding compact, spherical clusters. The complete linkage method finds similar clusters. The single linkage method (which is closely related to the minimal spanning tree) adopts a ‘friends of friends’ clustering strategy. The other methods can be regarded as aiming for clusters with characteristics somewhere between the single and complete link methods. Note however, that methods “median” and “centroid” are not leading to a monotone distance measure, or equivalently the resulting dendrograms can have so called inversions or reversals which are hard to interpret, but note the trichotomies in Legendre and Legendre (2012).
Hierarchical clustering can easily lead to dendrograms that are just plain wrong. Unless you known your data inside out (pretty much impossible for big data sets), this is largely unavoidable. One of the main reasons for this is that the clustering algorithm will work even on the most unsuitable data. Another reason is that the decision you make for creating clusters (Step 2 above) can lead to significantly different dendrograms. The choice can be tough to make in advance, and you may not be able to tell which of the four end results are the most suitable.
End of notebook 02