During the winter season, skiing and snowboarding are popular recreational activities, attracting numerous enthusiasts to ski resorts worldwide. This project aims to investigate whether ski resorts exhibit significant differences or if their characteristics are largely uniform, irrespective of location or features.
The analysis is based on a dataset sourced from Kaggle [https://www.kaggle.com/datasets/farheenshaukat/ski-resort], which provides detailed information about ski resorts, including their geographical location, pricing, slope characteristics, lift infrastructure, and snow cannon availability.
Clustering analysis was conducted in two phases. Initially, resorts were categorized based on their geographic locations. Subsequently, clustering was performed using different attributes such as pricing, slope variety, lift availability, and other relevant features to identify potential similar groupings.
What steps will be taken to get the best possible clustering: -Preprocessing dataset (correct data type and structure, choosing the region of interest) -See how K-means, PAM, DBSCAN and hierarchial clustering works based on location and ski resort features
resort <- read_excel("resorts.xls")
str(resort)
## tibble [497 × 25] (S3: tbl_df/tbl/data.frame)
## $ ID : num [1:497] 1 2 3 4 5 6 7 8 9 10 ...
## $ Resort : chr [1:497] "Hemsedal" "Geilosiden Geilo" "Golm" "Red Mountain Resort-Rossland" ...
## $ Latitude : chr [1:497] "60.9282437" "60.5345261" "47.05781" "49.1055201" ...
## $ Longitude : chr [1:497] "8.38348693" "8.2063719" "9.8281668" "-117.8462801" ...
## $ Country : chr [1:497] "Norway" "Norway" "Austria" "Canada" ...
## $ Continent : chr [1:497] "Europe" "Europe" "Europe" "North America" ...
## $ Price : num [1:497] 46 44 48 60 45 43 61 57 22 20 ...
## $ Season : chr [1:497] "November - May" "November - April" "December - April" "December - April" ...
## $ Highest point : num [1:497] 1450 1178 2110 2075 1030 ...
## $ Lowest point : num [1:497] 620 800 650 1185 195 ...
## $ Beginner slopes : num [1:497] 29 18 13 20 33 25 5 10 4 7 ...
## $ Intermediate slopes: num [1:497] 10 12 12 50 7 4 0 15 0 1 ...
## $ Difficult slopes : num [1:497] 4 4 1 50 4 11 5 10 0 0 ...
## $ Total.slopes : num [1:497] 43 34 26 120 44 40 10 35 4 8 ...
## $ Longest.run : num [1:497] 6 2 9 7 6 0 0 13 0 6 ...
## $ Snow.cannons : num [1:497] 325 100 123 0 150 40 0 0 0 0 ...
## $ Surface lifts : num [1:497] 15 18 4 2 14 7 5 4 3 4 ...
## $ Chair lifts : num [1:497] 6 6 4 5 3 4 1 6 1 0 ...
## $ Gondola lifts : num [1:497] 0 0 3 1 1 0 0 1 0 0 ...
## $ Total.lifts : num [1:497] 21 24 11 8 18 11 6 11 4 4 ...
## $ Lift.capacity : num [1:497] 22921 14225 16240 9200 21060 ...
## $ Child.friendly : chr [1:497] "Yes" "Yes" "Yes" "Yes" ...
## $ Snowparks : chr [1:497] "Yes" "Yes" "No" "Yes" ...
## $ Nightskiing : chr [1:497] "Yes" "Yes" "No" "Yes" ...
## $ Summer skiing : chr [1:497] "No" "No" "No" "No" ...
summary(resort) #no missing values
## ID Resort Latitude Longitude
## Min. : 1.0 Length:497 Min. :-45.05 Min. :-149.741
## 1st Qu.:125.0 Class :character 1st Qu.: 43.67 1st Qu.: 1.313
## Median :249.0 Mode :character Median : 46.35 Median : 8.206
## Mean :249.5 Mean : 43.19 Mean : -6.062
## 3rd Qu.:375.0 3rd Qu.: 47.33 3rd Qu.: 12.429
## Max. :499.0 Max. : 67.78 Max. : 176.877
## Country Continent Price Season
## Length:497 Length:497 Min. : 0.00 Length:497
## Class :character Class :character 1st Qu.: 36.00 Class :character
## Mode :character Mode :character Median : 45.00 Mode :character
## Mean : 48.76
## 3rd Qu.: 54.00
## Max. :141.00
## Highest point Lowest point Beginner slopes Intermediate slopes
## Min. : 163 Min. : 36 Min. : 0.00 Min. : 0.00
## 1st Qu.:1588 1st Qu.: 800 1st Qu.: 10.00 1st Qu.: 12.00
## Median :2175 Median :1121 Median : 18.00 Median : 25.00
## Mean :2161 Mean :1201 Mean : 31.87 Mean : 38.01
## 3rd Qu.:2700 3rd Qu.:1500 3rd Qu.: 30.00 3rd Qu.: 45.00
## Max. :3914 Max. :3286 Max. :312.00 Max. :239.00
## Difficult slopes Total.slopes Longest.run Snow.cannons
## Min. : 0.00 Min. : 1.00 Min. : 0.000 Min. : 0.0
## 1st Qu.: 3.00 1st Qu.: 30.00 1st Qu.: 0.000 1st Qu.: 0.0
## Median : 9.00 Median : 55.00 Median : 3.000 Median : 15.0
## Mean : 16.21 Mean : 86.09 Mean : 3.545 Mean : 179.3
## 3rd Qu.: 21.00 3rd Qu.:100.00 3rd Qu.: 6.000 3rd Qu.: 180.0
## Max. :126.00 Max. :600.00 Max. :16.000 Max. :2383.0
## Surface lifts Chair lifts Gondola lifts Total.lifts
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.: 3.00 1st Qu.: 3.00 1st Qu.: 0.000 1st Qu.: 10.00
## Median : 7.00 Median : 6.00 Median : 1.000 Median : 15.00
## Mean :11.28 Mean : 9.74 Mean : 3.264 Mean : 24.28
## 3rd Qu.:14.00 3rd Qu.:12.00 3rd Qu.: 4.000 3rd Qu.: 26.00
## Max. :89.00 Max. :74.00 Max. :40.000 Max. :174.00
## Lift.capacity Child.friendly Snowparks Nightskiing
## Min. : 0 Min. :0.000 Length:497 Min. :0.0000
## 1st Qu.: 11620 1st Qu.:1.000 Class :character 1st Qu.:0.0000
## Median : 18510 Median :1.000 Mode :character Median :0.0000
## Mean : 31699 Mean :0.992 Mean :0.4085
## 3rd Qu.: 32938 3rd Qu.:1.000 3rd Qu.:1.0000
## Max. :252280 Max. :1.000 Max. :1.0000
## Summer skiing
## Length:497
## Class :character
## Mode :character
##
##
##
Presenting the Ski resorts map
Most of the observations located in Europe and North America
continent_counts <- table(resort$Continent)
print(continent_counts) #Ski resorts mostly located in Europe
##
## Asia Europe North America Oceania South America
## 24 358 98 10 7
#Euclidean distance does not take into account the curvature of the Earth, in order to reduce the error possibility I'll choose ski resorts based in Europe only, which are the majority of the sample.
Europe<-resort[resort$Continent=="Europe",1:25]
leaflet(data=Europe[,3:4]) %>%
addTiles() %>%
addMarkers(~Longitude,~Latitude)%>%
addControl(
html = "<h3>Map of ski resorts across Europe</h3>",
position = "bottomleft"
)
Most of them concentrated in one place - Alps region.
hopkins(Europe[,3:4],nrow(Europe) - 1) #Rejecting the null hypothesis that data set is uniformly distributed which means that dataset contains meaningful clusters.
## [1] 0.9729364
#Optimal number of clusters
opt1<-Optimal_Clusters_KMeans(Europe[,3:4], max_clusters=10, plot_clusters=TRUE, criterion="silhouette") #3 clusters have the highest silhouette value
opt2<-Optimal_Clusters_KMeans(Europe[,3:4], max_clusters=10, plot_clusters = TRUE) #Elbow point suggests 3/4 clusters also
#Based on that I'll choose 3 clusters for futher analysis.
# Silhouette information for 3 clusters
clara<-clara(Europe[,3:4], 3)
plot(silhouette(clara)) #Average silhouette width =0.64 means that points are fairly well assigned to clusters
# However based on automatic selection and average silhouette width (=0.57 for 8 clusters), I'll consider this number of clusters also(dbscan result also).
opt_aut<-pamk(Europe[,3:4], krange=2:10, criterion="asw", usepam=TRUE, scaling=FALSE, alpha=0.001, diss=inherits(Europe[,3:4], "dist"), critout=FALSE) # fpc::pamk()
opt_aut #8 clusters suggested
## $pamobject
## Medoids:
## ID Latitude Longitude
## [1,] 170 61.35615 12.3670260
## [2,] 76 46.98628 10.2745177
## [3,] 172 47.23838 13.1793966
## [4,] 13 45.62648 6.8529624
## [5,] 16 56.85221 -4.9987681
## [6,] 100 42.69889 0.9347175
## [7,] 322 49.11919 20.0639550
## [8,] 329 43.68305 40.2664750
## Clustering vector:
## [1] 1 1 2 1 1 3 3 2 4 2 4 4 4 5 4 5 2 4 2 2 3 2 3 4 3 6 2 2 2 4 2 4 2 1 4 4 4
## [38] 4 4 4 6 6 4 4 4 4 4 4 4 4 4 4 4 4 4 4 6 1 1 3 6 7 3 5 2 2 3 2 2 3 3 2 2 3
## [75] 3 2 3 2 2 3 2 3 3 3 3 3 2 2 3 3 3 4 6 3 2 4 4 3 6 6 4 4 4 4 4 2 2 4 2 4 4
## [112] 4 4 4 2 3 4 4 4 4 4 6 2 4 6 4 3 6 4 3 4 4 4 4 2 4 2 3 4 4 3 4 6 4 4 6 2 3
## [149] 1 3 4 4 4 4 3 4 4 3 4 3 6 2 8 3 6 3 4 2 2 1 2 3 3 4 2 2 3 6 2 4 7 4 6 3 3
## [186] 2 2 4 3 4 2 2 6 4 4 6 4 2 2 4 7 6 1 4 4 1 2 4 3 2 4 6 7 4 4 4 4 2 4 3 4 2
## [223] 2 2 2 3 4 4 6 2 6 2 1 7 1 8 3 2 4 7 2 7 3 4 7 4 3 2 2 2 1 3 1 7 2 3 2 4 6
## [260] 4 3 4 2 6 4 8 7 6 3 2 2 2 4 3 7 3 2 5 2 7 3 4 6 1 2 2 3 2 2 2 7 3 2 2 3 3
## [297] 7 2 3 3 2 3 2 3 7 3 3 2 3 3 2 3 3 4 6 2 3 4 2 2 7 7 3 3 4 3 4 2 8 3 3 1 3
## [334] 2 2 3 4 3 2 2 3 2 7 1 1 4 7 3 2 4 4 2 3 4 6 2 2 3
## Objective function:
## build swap
## 1.476726 1.407349
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
##
## $nc
## [1] 8
##
## $crit
## [1] 0.0000000 0.3981429 0.4757789 0.3309225 0.4092174 0.4828333 0.5064075
## [8] 0.5333680 0.4885942 0.4999191
Since dataset contains the meaningful clusters, I’ll choose 3 and 8 clusters for futher anylasis.
cluster_km3 <- kmeans(Europe[,3:4],3)
plot(Europe$Latitude,Europe$Longitude,col = cluster_km3$cluster, pch = 19, xlab = "Latitude", ylab = "Longitude", main = "Clustering with K-means method - 3 clusters")
europe_kmeans3 <- data.frame(Europe[,3:4], cluster = as.factor(cluster_km3$cluster))
pal <- colorFactor(palette = "Set1", domain = europe_kmeans3$cluster)
leaflet(data = europe_kmeans3) %>%
addTiles() %>%
addCircleMarkers(~Longitude, ~Latitude,color = ~pal(cluster),
radius = 4, fill = TRUE, fillOpacity = 0.5, stroke = FALSE, popup = ~paste("Cluster:", cluster)
) %>%
addLegend(position = "topleft", pal = pal, values = europe_kmeans3$cluster, title = "3 clusters - K-MEANS")
Clusters represents 3 different regions: - 1-Scandinavia, Italy, Austria, Germany, Poland and Slovenia - 2- Poland, Slovenia, Ukraine and Balkans region - 3- UK, Spain, France and Switzerland
Short description on clusters: - Cluster 1 is for ski resorts in Kaukaz - Russia resort - Clusters 2, 5, 7 are close to each other. This is the region of Alps and it contains ski resorts from: France, Germany, Austria, Italy and Switzerland mostly. - Cluster 3 presents ski resorts in Finland - Cluster 4 contains ski resorts from France and Spain - mainly the area of the Pyrenees Mountains - Cluster 6 is for Scandinavia mountains and ski resorts based in UK - Cluster 8 is scattered on the map, points are not located close to each other. It contains ski resorts located next to Carpathian/Tatra Mountains and region of Balkans – Dinaric Mountains
eur <- Europe[, 3:4]
dbscan::kNNdistplot(eur, k = 3) # looking for optimal eps
abline(h = 1.5, lty = 2) #eps=1.5 seems to be the optimal level
db <- fpc::dbscan(eur, eps = 1.5, MinPts = 3)
plot(db, eur, main = "DBSCAN", frame = FALSE) # Plot DBSCAN results
fviz_cluster(db, eur, stand = FALSE, ellipse = FALSE, geom = "point")
dbscan8 <- data.frame(eur, cluster = as.factor(db$cluster))
pal <- colorFactor(palette = "Set1", domain = dbscan8$cluster)
leaflet(data = dbscan8) %>%
addTiles() %>%
addCircleMarkers(~Longitude, ~Latitude, color = ~pal(cluster),
radius = 5, fill = TRUE, fillOpacity = 0.5, stroke = FALSE,
popup = ~paste("Cluster:", cluster)
) %>%
addLegend(position = "topleft", pal = pal, values = dbscan8$cluster, title = "Clusters - DBSCAN")
DBSCAN suggests that 8 clusters are optimal however cluster 0 indicates the presence of outliers. Here, the outliers might result from using Euclidean distance, which is not ideal for larger regions as it does not account for the curvature of the Earth. Descrition of clusters: 1. Alps region – France, Italy, Austria, Switzerland, Slovenia, Germany 2. Scandinavian Mountains – Norway only 3. Pyrenees Mountains – France, Spain, Andorra 4. Rhodope/Old Balkan Mountains – Bulgaria 5. Caucasus Mountains – Russia 6. Scandinavian Mountains – Sweden/Norway 7. Carpathian Mountains – Poland, Slovakia 8. Scandinavian Mountains – Norway
Outlier observations were classified from: the UK, Lithuania, Germany, Romania, and other regions that are not located close to the main mountain ranges in the area.
The presence of outliers has influenced the distribution of the clusters, so they do not fully align with the results of the K-Means clustering method.
cluster_pam3<-eclust(eur, "pam", k= 3)
eur_pam3 <- data.frame(eur, cluster = as.factor(cluster_pam3$cluster))
pal <- colorFactor(palette = "Set1", domain = eur_pam3$cluster)
leaflet(data = eur_pam3) %>%
addTiles() %>%
addCircleMarkers( ~Longitude, ~Latitude, color = ~pal(cluster),
radius = 7, fill = TRUE, fillOpacity = 0.5, stroke = FALSE, popup = ~paste("Cluster:", cluster)
) %>%
addLegend(position = "topleft", pal = pal, values = eur_pam3$cluster, title = "PAM -3 clusters")
Cluster 1 is for ski resorts in Scandinavia and partially the UK. Cluster number 2 includes the Carpathians, Alps, Caucasus, Dinaric,Alps, and other regions. Cluster 3 contains ski resorts near the Pyrenees and the Alps.
Here’s the division of ski resorts based on PAM method.
1- Scandinavian ski resorts – Norway, Sweden, and Finland 2- Alps – Germany, Switzerland, Liechtenstein, Austria, Italy 3- Alps – Czech Republic, Austria, Slovenia, Italy 4- Alps – Mostly France and Switzerland 5- UK ski resorts 6- Ski resorts near the Pyrenees, also including resorts in France and Spain 7- Eastern Europe – Carpathians, Dinaric Alps – Mainly ski resorts in Poland, Lithuania, Bulgaria, Slovakia 8- Ski resorts in the Caucasus – Russia only
The 8 clusters with the PAM method are better represented on the map than the 8 clusters provided by the DBSCAN method, as there are no outliers.
To sum up, I’ve performed clustering using three methods: K-MEANS, DBSCAN, and PAM. The three clusters looked similar in all these methods; however, the biggest change was observed while performing clustering with 8 clusters. The Alps region contains enough resorts that, regardless of the method used, it always included 2 or 3 clusters out of 8. When clustering resorts based on their location, the PAM method performed the best on the map.
round(calinhara(eur, cluster_km3$cluster),digits=2) #K-means with 3 clusters
## [1] 206.41
round(calinhara(eur, cluster_km8$cluster),digits=2) #K-means with 8 clusters
## [1] 468.93
round(calinhara(eur, db$cluster),digits=2) #DBscan with 8 clusters
## [1] 356.16
round(calinhara(eur, cluster_pam3$cluster),digits=2) #PAM with 3 clusters
## [1] 199.17
round(calinhara(eur, cluster_pam8$cluster),digits=2) #PAM with 8 clusters
## [1] 466.52
Calinski - Harabasz index - The statistic is usually used for comparing solutions for alternative number of clusters. Based on this index, the higher statistic the better- Pam with 8 clusters seems to be the best method. However results of 8 clusters done by K-means and DBSCAN method are also high. Based on this index we see that 8 clusters are better solution than 3 one.
hc2 <- agnes(eur, method = "complete") # the same with different function - agnes
hc2$ac
## [1] 0.9916091
# agglomerative coefficient measures the amount of clustering structure found, values closer to 1 suggest strong clustering structure
# multiple methods to assess
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
ac <- function(x) {
agnes(eur, method = x)$ac
}
map_dbl(m, ac) #Ward method has the highest score
## average single complete ward
## 0.9882083 0.9786959 0.9916091 0.9963802
# cutting the tree, Ward's method
hc3 <- hclust(d, method = "ward.D2" )
# cut tree into 5 groups
sub_grp <- cutree(hc3, k = 5)
table(sub_grp)# number of members in each cluster
## sub_grp
## 1 2 3 4 5
## 19 165 118 33 23
More than 5 clusters means creating additional clusters with only 4 or 5 observations. Number of clusters 3 and 4 gives similar amount of observations in clusters. I think 5 would be the best for this dataset, I’ll use Ward’s method.
Plots with borders
plot(hc3, cex = 0.6)
rect.hclust(hc3, k = 5, border = 2:5)
fviz_cluster(list(data = eur, cluster = sub_grp))
Hierarchical clustering presented on map
eur_5 <- data.frame(eur, cluster = as.factor(sub_grp))
pal <- colorFactor(palette = "Set1", domain = eur_5$cluster)
leaflet(data = eur_5) %>%
addTiles() %>%
addCircleMarkers( ~Longitude, ~Latitude, color = ~pal(cluster),
radius = 7, fill = TRUE, fillOpacity = 0.5, stroke = FALSE, popup = ~paste("Cluster:", cluster)
) %>%
addLegend(position = "topright", pal = pal, values = eur_5$cluster, title = "Hierarchical clustering -5 clusters")
Hierarchical clustering with 5 clusters shows ski resorts located in Scandinavia, Eastern Europe, Western Europe, and two clusters in the Alps region.
Clustering results differ based on the number of clusters. I would like to explore these differences further and understand their origins. Specifically, I aim to investigate how the quality and facilities of ski resorts might impact the clustering outcomes.
I’ll take into consideration: ski pass price, total number of slopes and their length, total number of lifts and their capacity. Number of snow cannons, child friendliness and possibility of skiing at night will also be included.
eur_sample<-Europe[c("Price","Total.slopes","Lift.capacity","Longest.run","Snow.cannons","Total.lifts","Child.friendly","Nightskiing")]
eur_sample <- subset(eur_sample, Price!= 0)
eur_sample <- subset(eur_sample, Lift.capacity!= 0)
str(eur_sample)
## tibble [354 × 8] (S3: tbl_df/tbl/data.frame)
## $ Price : num [1:354] 46 44 48 45 43 22 20 35 81 54 ...
## $ Total.slopes : num [1:354] 43 34 26 44 40 4 8 34 322 175 ...
## $ Lift.capacity : num [1:354] 22921 14225 16240 21060 11900 ...
## $ Longest.run : num [1:354] 6 2 9 6 0 0 6 3 16 10 ...
## $ Snow.cannons : num [1:354] 325 100 123 150 40 0 0 0 1060 630 ...
## $ Total.lifts : num [1:354] 21 24 11 18 11 4 4 21 63 84 ...
## $ Child.friendly: num [1:354] 1 1 1 1 1 1 1 1 1 1 ...
## $ Nightskiing : num [1:354] 1 1 0 1 0 0 1 1 0 0 ...
summary(eur_sample)
## Price Total.slopes Lift.capacity Longest.run
## Min. :17.0 Min. : 1.00 Min. : 900 Min. : 0.00
## 1st Qu.:34.0 1st Qu.: 26.00 1st Qu.: 11907 1st Qu.: 0.00
## Median :43.0 Median : 50.00 Median : 19760 Median : 3.00
## Mean :41.9 Mean : 90.14 Mean : 36449 Mean : 3.76
## 3rd Qu.:49.0 3rd Qu.:100.00 3rd Qu.: 38400 3rd Qu.: 7.00
## Max. :81.0 Max. :600.00 Max. :252280 Max. :16.00
## Snow.cannons Total.lifts Child.friendly Nightskiing
## Min. : 0.0 Min. : 1.00 Min. :0.0000 Min. :0.0000
## 1st Qu.: 0.0 1st Qu.: 11.00 1st Qu.:1.0000 1st Qu.:0.0000
## Median : 65.5 Median : 17.00 Median :1.0000 Median :0.0000
## Mean : 230.2 Mean : 28.21 Mean :0.9972 Mean :0.4181
## 3rd Qu.: 265.2 3rd Qu.: 30.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :2383.0 Max. :174.00 Max. :1.0000 Max. :1.0000
hopkins(eur_sample, nrow(eur_sample) -1)
## [1] 0.9999989
Rejecting the null hypothesis that the dataset is uniformly distributed. The Hopkins statistic is very close to 1, indicating that our data is highly clusterable.
opt1<-Optimal_Clusters_KMeans(eur_sample, max_clusters=10, plot_clusters = TRUE) #elbow method suggests 2 clusters
opt2<-Optimal_Clusters_KMeans(eur_sample, max_clusters=10, plot_clusters=TRUE, criterion="silhouette") #according to silhouette optimal number of clusters is 2, however 3 clusters would also work out
fviz_nbclust(eur_sample, kmeans, method = "wss") #2 clusters suggested
Optimal number clusters chosen =2
Average Silhouette Width =0.71- value of 0.71 indicates that the clustering structure is strong and well-defined.
cluster_km <- kmeans(eur_sample, 2)
cluster_km$centers
## Price Total.slopes Lift.capacity Longest.run Snow.cannons Total.lifts
## 1 40.09061 53.99676 21791.44 3.553398 124.343 18.35599
## 2 54.35556 338.35556 137101.38 5.177778 957.200 95.88889
## Child.friendly Nightskiing
## 1 0.9967638 0.3786408
## 2 1.0000000 0.6888889
The first cluster contains cheaper ski resorts with lower-quality slopes and facilities. Both clusters appear to be almost identical in terms of child-friendliness.
The second cluster consists of ski resorts with higher ski pass prices, a greater number of total slopes and lifts, larger lift capacities, and longer runs. It also has more snow cannons and more options for night skiing.
cluster_pam<-eclust(eur_sample, "pam", k= 2)
cluster_pam$medoids
## Price Total.slopes Lift.capacity Longest.run Snow.cannons Total.lifts
## [1,] 50 60 16381 8 75 15
## [2,] 81 322 93464 0 1060 63
## Child.friendly Nightskiing
## [1,] 1 1
## [2,] 1 0
Results similar to the output of K-MEANS method. Differences: according to PAM method more expensive ski resorts medoids do not have nightskiing possibilities and longer ski runs. (as it was in centroids in K-MEANS)
cluster_clara<-eclust(eur_sample, "clara", k=2)
cluster_clara
## Call: fun_clust(x = x, k = k)
## Medoids:
## Price Total.slopes Lift.capacity Longest.run Snow.cannons Total.lifts
## [1,] 30 65 17948 0 100 22
## [2,] 55 179 92510 8 1102 54
## Child.friendly Nightskiing
## [1,] 1 0
## [2,] 1 1
## Objective function: 15191.25
## Clustering vector: int [1:354] 1 1 1 1 1 1 1 1 2 2 1 2 1 1 1 1 1 1 ...
## Cluster sizes: 293 61
## Best sample:
## [1] 2 9 33 54 57 62 67 68 74 80 86 93 109 111 118 119 121 135 138
## [20] 142 149 153 155 157 164 179 180 187 194 226 260 262 265 270 275 278 284 306
## [39] 311 313 336 338 343 352
##
## Available components:
## [1] "sample" "medoids" "i.med" "clustering" "objective"
## [6] "clusinfo" "diss" "call" "silinfo" "data"
## [11] "clust_plot" "nbclust"
The results of CLARA show quite similar differences in clusters as K-Means.
## dbscan Pts=354 MinPts=9 eps=1500
## 0 1 2
## border 69 6 8
## seed 0 269 2
## total 69 275 10
According to DBSCAN method: Cheaper ski resorts have less slopes which are shorter, less snow cannons and they have less ski lifts which have lower capacity. Cheaper ski resorts have also lower number of child friendly ski resorts and poorer offer of night skiing compared to more expensive resorts. Again, we can conclude that the price of the ski pass goes hand in hand with the quality of the resort and its facilities.
The differences between the clusters stem primarily from the ski pass price, the number of slopes, and their length. The number of lifts and their capacity are also important factors, as well as the availability of snow cannons.
In general, the “cheaper” cluster across all methods tends to have fewer slopes, a lower number of snow cannons, fewer lifts, and lower lift capacities. We can assume that a higher price is associated with better slopes and facilities. However, child-friendly resorts and night skiing options do not contribute significantly to the clustering results.
To sum up, ski resort data was clustered using different methods. I performed clustering based on location as well as on variables describing the ski resorts. As part of the preprocessing, I selected only European ski resorts because the clustering methods presented do not work perfectly with the curvature of the Earth, and working with a smaller region reduces the possibility of incorrect calculations, errors or outliers.
Clustering based on location resulted in varying numbers of optimal clusters, with this number changing depending on the clustering method used. However, when it came to the features and characteristics of the ski resorts, all methods showed the same results.
Clustering based on location primarily placed clusters around the same mountain regions, with the Alps showing 2 or 3 clusters due to the larger number of ski resorts in that area. Meanwhile, clustering based on features revealed two clusters that differed in quality and service price. This type of data is definitely easier to cluster than location-based data.
In conclusion, ski resorts are more differentiated based on their location than based on their features.