This is a demonstration of visualizing, clustering, and benchmarking data that includes a geographical dimension.
Applying basic analysis of your data can produce novel insights to your business or customers, which if not already a fundamental expectation, could give you a competitive advantage or new value proposition. In the financial and property industries, you might have a portfolio of physical assets such as buildings or houses. Grouping similar properties enables you to benchmark individuals against their peers. A benchmark report gives owners, investors, and managers a tool for evaluating and planning for the performance of their asset.
As a proxy for any ‘real’ portfolio dataset, this example uses data for mountain peaks located in Colorado, USA that have an elevation greater than 14,000 ft (4,270 meters)
Presented first is a simple cluster analysis on the location using lattitude and longitude. The model is then expanded to consider other factors like those describing the hikiing experience.
The steps
The data included in this example is a tidy list of Colorado peaks with sixteen variables describing each peak. Here is a description of each variable:
Reference the code book for more details about these data elements.
The table below shows the first five rows of the dataset.
Mountain.Peak | Mountain.Range | Elevation_ft | Prominence_ft | Isolation_mi | Lat | Long | Standard.Route | Distance_mi | Elevation.Gain_ft | Difficulty | Traffic.Low | Traffic.High |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Mount Elbert | Sawatch Range | 14440 | 9093 | 670.00 | 39.1178 | -106.4454 | Northeast Ridge | 9.50 | 4700 | Class 1 | 20000 | 25000 |
Mount Massive | Sawatch Range | 14428 | 1961 | 5.06 | 39.1875 | -106.4757 | East Slopes | 14.50 | 4500 | Class 2 | 7000 | 10000 |
Mount Harvard | Sawatch Range | 14421 | 2360 | 14.93 | 38.9244 | -106.3207 | South Slopes | 14.00 | 4600 | Class 2 | 5000 | 7000 |
Blanca Peak | Sangre de Cristo Range | 14351 | 5326 | 103.40 | 37.5775 | -105.4856 | Northwest Ridge | 17.00 | 6500 | Hard Class 2 | 1000 | 3000 |
La Plata Peak | Sawatch Range | 14343 | 1836 | 6.28 | 39.0294 | -106.4729 | Northwest Ridge | 9.25 | 4500 | Class 2 | 5000 | 7000 |
Starting with peak locations, a quick look at the data gives an initial view as to the greographic spread.
The isolation histogram already hints towards the nature of some clustering in the data. Most peaks have a proximity of 10 miles. Another group seem to be within 10 to 30 miles and then a handful are more isolated. Mt Elbert was excluded from this histogram because it is the highest mountain and the next larger peak is Mount Whitney 670 miles away. With the scope of ‘Colorado Fourteeners’ this isolation figure is an outlier.
The prominence data indicate a little more of an even distribution. Mt Elbert is once again a standout, but this time, it is included because its it not an outlier. It’s prominence of 9,093 is nearly double any other peak in the data.
The boxplot with jitters shows the distribution of peak elevations within their respective mountain ranges. The mountain range is a natural clustering mechanism. Studying the jitters reveals possible sub-groupings within a range if elevation is an important factor in the data.
We will first conduct a hierarchical cluster analysis based on location in terms of latitude and longitude. Then, interpret the results by comparing to how these clusters relate to the mountain ranges.
First, create a matrix where the distance between every peak is calculated. For illustrative purposes, the distance is calculated using the Haversine formula and presented in miles.
This distance matrix is too big to display in its entirety, but below is an example of the first five mountains. You can see the distance between each in miles (as the crow flies).
Mount Elbert | Mount Massive | Mount Harvard | Blanca Peak | La Plata Peak | |
---|---|---|---|---|---|
Mount Elbert | 0.0 | ||||
Mount Massive | 5.1 | 0.0 | |||
Mount Harvard | 15.0 | 20.0 | 0.0 | ||
Blanca Peak | 118.6 | 123.6 | 103.6 | 0.0 | |
La Plata Peak | 6.3 | 10.9 | 10.9 | 113.8 | 0.0 |
The first five peaks are plotted on a map…
You can see just in the first five data elements, that two or three groups appear depending on how ‘deep’ you look.
Next, we feed the distance matrix into a hierarchical cluster algorithm. You can visualize the results of how the peaks can be grouped with a dendrogram.
The vertical axis lists every mountain peak and the horizontal axis measures the degree of distance between groups. The horizontal axis ranges from zero to > 120. At zero, every mountain is its own cluster, so five clusters. At 120, you can see the data falls into two clusters and at about 20, there are three clusters.
Here is the dendrogram for all peaks
You can use the dendrogram to determine how much distance there s between groups, or how many groups your data falls into. Picking a level to cluster on is called: cutting the tree.
Because the peaks are attributed to one of six mountain ranges, we will cut the tree at a height that gives six groups. The table below compares the hierarchical groups with the mountain ranges.
1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|
Elk Mountains | 5 | 0 | 0 | 0 | 0 | 0 |
Front Range | 0 | 0 | 0 | 5 | 1 | 0 |
Mosquito Range | 5 | 0 | 0 | 0 | 0 | 0 |
San Juan Mountains | 0 | 0 | 12 | 0 | 0 | 0 |
Sangre de Cristo Range | 0 | 9 | 0 | 0 | 0 | 1 |
Sawatch Range | 15 | 0 | 0 | 0 | 0 | 0 |
The groupings based on distance is similar to how they are organized in mountain ranges. Group 1 includes peaks from three different mountain ranges. Group 3 clusters all peaks from the San Juan range and Group 2 identifies the Sangre de Cristo Range.
The map below shows the peaks where the color represents each group and the labels show the mountain range. Mouse over to see the labels.
That is a simple example of clustering peaks by location. But, maybe we want to find similar groups of peaks by including more factors. perhaps, we want to group by a hikers experience for example.
The data also includes variables describing the hiking difficulty (on the standard route),
The table below shows the first five rows
Mountain.Peak | Distance_mi | Elevation.Gain_ft | Difficulty_Rating |
---|---|---|---|
Mount Elbert | 9.50 | 4700 | 1.0 |
Mount Massive | 14.50 | 4500 | 2.0 |
Mount Harvard | 14.00 | 4600 | 2.0 |
Blanca Peak | 17.00 | 6500 | 2.5 |
La Plata Peak | 9.25 | 4500 | 2.0 |
We will run another cluster analysis, but this time including the hiking difficulty variables along with the location (lat & long). Because the factors are measured using different scales, the data will be normalized before calculating distances.
scaledFactors <- scale(peaks[,c(8,9,11,12,18)]) # scaledFactors <- scale(peaks[,c(8,9,11,12,18,19)])
dist_matrix <- dist(scaledFactors) # convert to a matrix for hClust
peak_clusters <- hclust(dist_matrix)
Based on what we know about the data and reviewing the dendrogram, we will cut the tree to produce 8 groups.
Group8 | Members |
---|---|
1 | 16 |
2 | 6 |
3 | 6 |
4 | 11 |
5 | 5 |
6 | 4 |
7 | 1 |
8 | 4 |
The map and boxplot below show how the clustering algorithm grouped similar mountains based on hiker experience factors.
For example, you can see that Group 4 includes the easiest mountains that also happen to be more accessible to people living in Denver and the front range. Group 5 include the most technically challenging peaks.
If you have a large dataset, especially one with a lot of dimensions, clustering is a great step to pulling out relevant and useful insights. Within the property management context, for example, it can show Building Managers how their performance compares with other similar facilities. It can also show investors revenue and cost benchmarks items for a particular area.
For fourteener hikers, the groupings can narrow the 53 peaks down to a handful that match their interests.
For example, say there is someone who hiked Snowmass mountain and loved the technical challenge.
The person’s interests might best relate to Group five. The radar charts show Snowmass and the other two peaks in Group 5 that are in the Elk Mountains Range
The table and map below show all the mountains in Group 5
Mountain.Peak | Mountain.Range | Standard.Route | Distance_mi | Elevation.Gain_ft | Difficulty | hikeGroup | |
---|---|---|---|---|---|---|---|
16 | Mount Wilson | San Juan Mountains | North Slopes | 16.00 | 5100 | Class 4 | 5 |
29 | Capitol Peak | Elk Mountains | Northeast Ridge | 17.00 | 5300 | Class 4 | 5 |
31 | Snowmass Mountain | Elk Mountains | East Slopes | 22.00 | 5800 | Hard Class 3 | 5 |
39 | Sunlight Peak | San Juan Mountains | South Face | 17.00 | 6000 | Class 4 | 5 |
47 | Pyramid Peak | Elk Mountains | Northeast Ridge | 8.25 | 4500 | Class 4 | 5 |