Clustering & Benchmarking Property

Introduction

Data

Exploratory Analysis

Clustering on Location

Clustering on More Dimensions

Benchmarking

This is a demonstration of visualizing, clustering, and benchmarking data that includes a geographical dimension.

Applying basic analysis of your data can produce novel insights to your business or customers, which if not already a fundamental expectation, could give you a competitive advantage or new value proposition. In the financial and property industries, you might have a portfolio of physical assets such as buildings or houses. Grouping similar properties enables you to benchmark individuals against their peers. A benchmark report gives owners, investors, and managers a tool for evaluating and planning for the performance of their asset.

As a proxy for any ‘real’ portfolio dataset, this example uses data for mountain peaks located in Colorado, USA that have an elevation greater than 14,000 ft (4,270 meters)

Presented first is a simple cluster analysis on the location using lattitude and longitude. The model is then expanded to consider other factors like those describing the hikiing experience.

The steps

Load & Familiarize Data
Pick factors and consider cluster sizes
Cluster based on distance
Add other factors to cluster analysis
Visualize data on a map
Using the cluster for benchmarking

The data included in this example is a tidy list of Colorado peaks with sixteen variables describing each peak. Here is a description of each variable:

ID - A unique Identifier for each row
Mountain Peak - The name of the peak
Mountain Range - The name of the primary mountain range the peak is a member of
Elevation_ft - The peak elevation in feet
Fourteener - An indicator if the peak is considered a fourteener and includes a value of Y or N
Prominence_ft - How much higher the peak is in feet from the next highest point
Isolation_mi - The distance in miles from the nearest point of the same or higher elevation
Lat -The latitudinal coordinate in decimal form
Long - The longitudinal coordinate in decimal form
Standard Route - The name of the most commonly used hiking/climbing route to the peak
Distance_mi - The distance of the standard route in miles
Elevation Gain_ft - The elevation gain of the standard route in feet
Difficulty - The Yosemite Decimal System difficulty rating, a value ranging from Class 1 (easiest) to Class 5 (most difficult)
Traffic Low - The low range of estimated visits in the year 2017
Traffic High - The high range of estimated visits in the year 2017
Photo - A URL to a photo of the peak

Reference the code book for more details about these data elements.

The table below shows the first five rows of the dataset.

Mountain.Peak	Mountain.Range	Elevation_ft	Prominence_ft	Isolation_mi	Lat	Long	Standard.Route	Distance_mi	Elevation.Gain_ft	Difficulty	Traffic.Low	Traffic.High
Mount Elbert	Sawatch Range	14440	9093	670.00	39.1178	-106.4454	Northeast Ridge	9.50	4700	Class 1	20000	25000
Mount Massive	Sawatch Range	14428	1961	5.06	39.1875	-106.4757	East Slopes	14.50	4500	Class 2	7000	10000
Mount Harvard	Sawatch Range	14421	2360	14.93	38.9244	-106.3207	South Slopes	14.00	4600	Class 2	5000	7000
Blanca Peak	Sangre de Cristo Range	14351	5326	103.40	37.5775	-105.4856	Northwest Ridge	17.00	6500	Hard Class 2	1000	3000
La Plata Peak	Sawatch Range	14343	1836	6.28	39.0294	-106.4729	Northwest Ridge	9.25	4500	Class 2	5000	7000

Starting with peak locations, a quick look at the data gives an initial view as to the greographic spread.

The isolation histogram already hints towards the nature of some clustering in the data. Most peaks have a proximity of 10 miles. Another group seem to be within 10 to 30 miles and then a handful are more isolated. Mt Elbert was excluded from this histogram because it is the highest mountain and the next larger peak is Mount Whitney 670 miles away. With the scope of ‘Colorado Fourteeners’ this isolation figure is an outlier.

The prominence data indicate a little more of an even distribution. Mt Elbert is once again a standout, but this time, it is included because its it not an outlier. It’s prominence of 9,093 is nearly double any other peak in the data.

The boxplot with jitters shows the distribution of peak elevations within their respective mountain ranges. The mountain range is a natural clustering mechanism. Studying the jitters reveals possible sub-groupings within a range if elevation is an important factor in the data.

We will first conduct a hierarchical cluster analysis based on location in terms of latitude and longitude. Then, interpret the results by comparing to how these clusters relate to the mountain ranges.

First, create a matrix where the distance between every peak is calculated. For illustrative purposes, the distance is calculated using the Haversine formula and presented in miles.

This distance matrix is too big to display in its entirety, but below is an example of the first five mountains. You can see the distance between each in miles (as the crow flies).

	Mount Elbert	Mount Massive	Mount Harvard	Blanca Peak	La Plata Peak
Mount Elbert	0.0
Mount Massive	5.1	0.0
Mount Harvard	15.0	20.0	0.0
Blanca Peak	118.6	123.6	103.6	0.0
La Plata Peak	6.3	10.9	10.9	113.8	0.0

The first five peaks are plotted on a map…

You can see just in the first five data elements, that two or three groups appear depending on how ‘deep’ you look.

Next, we feed the distance matrix into a hierarchical cluster algorithm. You can visualize the results of how the peaks can be grouped with a dendrogram.

The vertical axis lists every mountain peak and the horizontal axis measures the degree of distance between groups. The horizontal axis ranges from zero to > 120. At zero, every mountain is its own cluster, so five clusters. At 120, you can see the data falls into two clusters and at about 20, there are three clusters.

Here is the dendrogram for all peaks

You can use the dendrogram to determine how much distance there s between groups, or how many groups your data falls into. Picking a level to cluster on is called: cutting the tree.

Because the peaks are attributed to one of six mountain ranges, we will cut the tree at a height that gives six groups. The table below compares the hierarchical groups with the mountain ranges.

The groupings based on distance is similar to how they are organized in mountain ranges. Group 1 includes peaks from three different mountain ranges. Group 3 clusters all peaks from the San Juan range and Group 2 identifies the Sangre de Cristo Range.

The map below shows the peaks where the color represents each group and the labels show the mountain range. Mouse over to see the labels.

That is a simple example of clustering peaks by location. But, maybe we want to find similar groups of peaks by including more factors. perhaps, we want to group by a hikers experience for example.

The data also includes variables describing the hiking difficulty (on the standard route),

Distance_mi - The distance in miles for the standard route
Elevation.Gain_ft - The elevation gain along the standard route
Difficulty_Rating - A numeric representation of the hike’s class

The table below shows the first five rows

Mountain.Peak	Distance_mi	Elevation.Gain_ft	Difficulty_Rating
Mount Elbert	9.50	4700	1.0
Mount Massive	14.50	4500	2.0
Mount Harvard	14.00	4600	2.0
Blanca Peak	17.00	6500	2.5
La Plata Peak	9.25	4500	2.0

We will run another cluster analysis, but this time including the hiking difficulty variables along with the location (lat & long). Because the factors are measured using different scales, the data will be normalized before calculating distances.

scaledFactors <- scale(peaks[,c(8,9,11,12,18)]) # scaledFactors <- scale(peaks[,c(8,9,11,12,18,19)])

dist_matrix <- dist(scaledFactors) # convert to a matrix for hClust

peak_clusters <- hclust(dist_matrix)

Based on what we know about the data and reviewing the dendrogram, we will cut the tree to produce 8 groups.

Group8	Members
1	16
2	6
3	6
4	11
5	5
6	4
7	1
8	4

The map and boxplot below show how the clustering algorithm grouped similar mountains based on hiker experience factors.

For example, you can see that Group 4 includes the easiest mountains that also happen to be more accessible to people living in Denver and the front range. Group 5 include the most technically challenging peaks.

Elk Mountains

Front Range

Mosquito Range

San Juan Mountains

Sangre de Cristo Range

Sawatch Range

If you have a large dataset, especially one with a lot of dimensions, clustering is a great step to pulling out relevant and useful insights. Within the property management context, for example, it can show Building Managers how their performance compares with other similar facilities. It can also show investors revenue and cost benchmarks items for a particular area.

For fourteener hikers, the groupings can narrow the 53 peaks down to a handful that match their interests.

For example, say there is someone who hiked Snowmass mountain and loved the technical challenge.

The person’s interests might best relate to Group five. The radar charts show Snowmass and the other two peaks in Group 5 that are in the Elk Mountains Range

The table and map below show all the mountains in Group 5

	Mountain.Peak	Mountain.Range	Standard.Route	Distance_mi	Elevation.Gain_ft	Difficulty	hikeGroup
16	Mount Wilson	San Juan Mountains	North Slopes	16.00	5100	Class 4	5
29	Capitol Peak	Elk Mountains	Northeast Ridge	17.00	5300	Class 4	5
31	Snowmass Mountain	Elk Mountains	East Slopes	22.00	5800	Hard Class 3	5
39	Sunlight Peak	San Juan Mountains	South Face	17.00	6000	Class 4	5
47	Pyramid Peak	Elk Mountains	Northeast Ridge	8.25	4500	Class 4	5