Clustering among Ski Resorts

During the winter season, skiing and snowboarding are popular recreational activities, attracting numerous enthusiasts to ski resorts worldwide. This project aims to investigate whether ski resorts exhibit significant differences or if their characteristics are largely uniform, irrespective of location or features.

The analysis is based on a dataset sourced from Kaggle [https://www.kaggle.com/datasets/farheenshaukat/ski-resort], which provides detailed information about ski resorts, including their geographical location, pricing, slope characteristics, lift infrastructure, and snow cannon availability.

Clustering analysis was conducted in two phases. Initially, resorts were categorized based on their geographic locations. Subsequently, clustering was performed using different attributes such as pricing, slope variety, lift availability, and other relevant features to identify potential similar groupings.

What steps will be taken to get the best possible clustering: -Preprocessing dataset (correct data type and structure, choosing the region of interest) -See how K-means, PAM, DBSCAN and hierarchial clustering works based on location and ski resort features

Preprocessing

resort <- read_excel("resorts.xls")
str(resort)

## tibble [497 × 25] (S3: tbl_df/tbl/data.frame)
##  $ ID                 : num [1:497] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Resort             : chr [1:497] "Hemsedal" "Geilosiden Geilo" "Golm" "Red Mountain Resort-Rossland" ...
##  $ Latitude           : chr [1:497] "60.9282437" "60.5345261" "47.05781" "49.1055201" ...
##  $ Longitude          : chr [1:497] "8.38348693" "8.2063719" "9.8281668" "-117.8462801" ...
##  $ Country            : chr [1:497] "Norway" "Norway" "Austria" "Canada" ...
##  $ Continent          : chr [1:497] "Europe" "Europe" "Europe" "North America" ...
##  $ Price              : num [1:497] 46 44 48 60 45 43 61 57 22 20 ...
##  $ Season             : chr [1:497] "November - May" "November - April" "December - April" "December - April" ...
##  $ Highest point      : num [1:497] 1450 1178 2110 2075 1030 ...
##  $ Lowest point       : num [1:497] 620 800 650 1185 195 ...
##  $ Beginner slopes    : num [1:497] 29 18 13 20 33 25 5 10 4 7 ...
##  $ Intermediate slopes: num [1:497] 10 12 12 50 7 4 0 15 0 1 ...
##  $ Difficult slopes   : num [1:497] 4 4 1 50 4 11 5 10 0 0 ...
##  $ Total.slopes       : num [1:497] 43 34 26 120 44 40 10 35 4 8 ...
##  $ Longest.run        : num [1:497] 6 2 9 7 6 0 0 13 0 6 ...
##  $ Snow.cannons       : num [1:497] 325 100 123 0 150 40 0 0 0 0 ...
##  $ Surface lifts      : num [1:497] 15 18 4 2 14 7 5 4 3 4 ...
##  $ Chair lifts        : num [1:497] 6 6 4 5 3 4 1 6 1 0 ...
##  $ Gondola lifts      : num [1:497] 0 0 3 1 1 0 0 1 0 0 ...
##  $ Total.lifts        : num [1:497] 21 24 11 8 18 11 6 11 4 4 ...
##  $ Lift.capacity      : num [1:497] 22921 14225 16240 9200 21060 ...
##  $ Child.friendly     : chr [1:497] "Yes" "Yes" "Yes" "Yes" ...
##  $ Snowparks          : chr [1:497] "Yes" "Yes" "No" "Yes" ...
##  $ Nightskiing        : chr [1:497] "Yes" "Yes" "No" "Yes" ...
##  $ Summer skiing      : chr [1:497] "No" "No" "No" "No" ...

summary(resort) #no missing values

##        ID           Resort             Latitude        Longitude       
##  Min.   :  1.0   Length:497         Min.   :-45.05   Min.   :-149.741  
##  1st Qu.:125.0   Class :character   1st Qu.: 43.67   1st Qu.:   1.313  
##  Median :249.0   Mode  :character   Median : 46.35   Median :   8.206  
##  Mean   :249.5                      Mean   : 43.19   Mean   :  -6.062  
##  3rd Qu.:375.0                      3rd Qu.: 47.33   3rd Qu.:  12.429  
##  Max.   :499.0                      Max.   : 67.78   Max.   : 176.877  
##    Country           Continent             Price           Season         
##  Length:497         Length:497         Min.   :  0.00   Length:497        
##  Class :character   Class :character   1st Qu.: 36.00   Class :character  
##  Mode  :character   Mode  :character   Median : 45.00   Mode  :character  
##                                        Mean   : 48.76                     
##                                        3rd Qu.: 54.00                     
##                                        Max.   :141.00                     
##  Highest point   Lowest point  Beginner slopes  Intermediate slopes
##  Min.   : 163   Min.   :  36   Min.   :  0.00   Min.   :  0.00     
##  1st Qu.:1588   1st Qu.: 800   1st Qu.: 10.00   1st Qu.: 12.00     
##  Median :2175   Median :1121   Median : 18.00   Median : 25.00     
##  Mean   :2161   Mean   :1201   Mean   : 31.87   Mean   : 38.01     
##  3rd Qu.:2700   3rd Qu.:1500   3rd Qu.: 30.00   3rd Qu.: 45.00     
##  Max.   :3914   Max.   :3286   Max.   :312.00   Max.   :239.00     
##  Difficult slopes  Total.slopes     Longest.run      Snow.cannons   
##  Min.   :  0.00   Min.   :  1.00   Min.   : 0.000   Min.   :   0.0  
##  1st Qu.:  3.00   1st Qu.: 30.00   1st Qu.: 0.000   1st Qu.:   0.0  
##  Median :  9.00   Median : 55.00   Median : 3.000   Median :  15.0  
##  Mean   : 16.21   Mean   : 86.09   Mean   : 3.545   Mean   : 179.3  
##  3rd Qu.: 21.00   3rd Qu.:100.00   3rd Qu.: 6.000   3rd Qu.: 180.0  
##  Max.   :126.00   Max.   :600.00   Max.   :16.000   Max.   :2383.0  
##  Surface lifts    Chair lifts    Gondola lifts     Total.lifts    
##  Min.   : 0.00   Min.   : 0.00   Min.   : 0.000   Min.   :  0.00  
##  1st Qu.: 3.00   1st Qu.: 3.00   1st Qu.: 0.000   1st Qu.: 10.00  
##  Median : 7.00   Median : 6.00   Median : 1.000   Median : 15.00  
##  Mean   :11.28   Mean   : 9.74   Mean   : 3.264   Mean   : 24.28  
##  3rd Qu.:14.00   3rd Qu.:12.00   3rd Qu.: 4.000   3rd Qu.: 26.00  
##  Max.   :89.00   Max.   :74.00   Max.   :40.000   Max.   :174.00  
##  Lift.capacity    Child.friendly   Snowparks          Nightskiing    
##  Min.   :     0   Min.   :0.000   Length:497         Min.   :0.0000  
##  1st Qu.: 11620   1st Qu.:1.000   Class :character   1st Qu.:0.0000  
##  Median : 18510   Median :1.000   Mode  :character   Median :0.0000  
##  Mean   : 31699   Mean   :0.992                      Mean   :0.4085  
##  3rd Qu.: 32938   3rd Qu.:1.000                      3rd Qu.:1.0000  
##  Max.   :252280   Max.   :1.000                      Max.   :1.0000  
##  Summer skiing     
##  Length:497        
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Presenting the Ski resorts map

Most of the observations located in Europe and North America

continent_counts <- table(resort$Continent)
print(continent_counts) #Ski resorts mostly located in Europe

## 
##          Asia        Europe North America       Oceania South America 
##            24           358            98            10             7

#Euclidean distance does not take into account the curvature of the Earth, in order to reduce the error possibility I'll choose ski resorts based in Europe only, which are the majority of the sample.

Europe<-resort[resort$Continent=="Europe",1:25]
leaflet(data=Europe[,3:4]) %>%
  addTiles() %>%
  addMarkers(~Longitude,~Latitude)%>%
  addControl(
    html = "<h3>Map of ski resorts across Europe</h3>",
    position = "bottomleft"
  )

Most of them concentrated in one place - Alps region.

Part 1 - clustering based on location only

How many clusters will be the best?

hopkins(Europe[,3:4],nrow(Europe) - 1) #Rejecting the null hypothesis that data set is uniformly distributed which means that dataset contains meaningful clusters.

## [1] 0.9729364

#Optimal number of clusters
opt1<-Optimal_Clusters_KMeans(Europe[,3:4], max_clusters=10, plot_clusters=TRUE, criterion="silhouette") #3 clusters have the highest silhouette value

opt2<-Optimal_Clusters_KMeans(Europe[,3:4], max_clusters=10, plot_clusters = TRUE) #Elbow point suggests 3/4 clusters also

#Based on that I'll choose 3 clusters for futher analysis.
# Silhouette information for 3 clusters
clara<-clara(Europe[,3:4], 3) 
plot(silhouette(clara)) #Average silhouette width =0.64 means that points are fairly well assigned to clusters

# However based on automatic selection and average silhouette width (=0.57 for 8 clusters), I'll consider this number of clusters also(dbscan result also).
opt_aut<-pamk(Europe[,3:4], krange=2:10, criterion="asw", usepam=TRUE, scaling=FALSE, alpha=0.001, diss=inherits(Europe[,3:4], "dist"), critout=FALSE) # fpc::pamk()
opt_aut   #8 clusters suggested

## $pamobject
## Medoids:
##       ID Latitude  Longitude
## [1,] 170 61.35615 12.3670260
## [2,]  76 46.98628 10.2745177
## [3,] 172 47.23838 13.1793966
## [4,]  13 45.62648  6.8529624
## [5,]  16 56.85221 -4.9987681
## [6,] 100 42.69889  0.9347175
## [7,] 322 49.11919 20.0639550
## [8,] 329 43.68305 40.2664750
## Clustering vector:
##   [1] 1 1 2 1 1 3 3 2 4 2 4 4 4 5 4 5 2 4 2 2 3 2 3 4 3 6 2 2 2 4 2 4 2 1 4 4 4
##  [38] 4 4 4 6 6 4 4 4 4 4 4 4 4 4 4 4 4 4 4 6 1 1 3 6 7 3 5 2 2 3 2 2 3 3 2 2 3
##  [75] 3 2 3 2 2 3 2 3 3 3 3 3 2 2 3 3 3 4 6 3 2 4 4 3 6 6 4 4 4 4 4 2 2 4 2 4 4
## [112] 4 4 4 2 3 4 4 4 4 4 6 2 4 6 4 3 6 4 3 4 4 4 4 2 4 2 3 4 4 3 4 6 4 4 6 2 3
## [149] 1 3 4 4 4 4 3 4 4 3 4 3 6 2 8 3 6 3 4 2 2 1 2 3 3 4 2 2 3 6 2 4 7 4 6 3 3
## [186] 2 2 4 3 4 2 2 6 4 4 6 4 2 2 4 7 6 1 4 4 1 2 4 3 2 4 6 7 4 4 4 4 2 4 3 4 2
## [223] 2 2 2 3 4 4 6 2 6 2 1 7 1 8 3 2 4 7 2 7 3 4 7 4 3 2 2 2 1 3 1 7 2 3 2 4 6
## [260] 4 3 4 2 6 4 8 7 6 3 2 2 2 4 3 7 3 2 5 2 7 3 4 6 1 2 2 3 2 2 2 7 3 2 2 3 3
## [297] 7 2 3 3 2 3 2 3 7 3 3 2 3 3 2 3 3 4 6 2 3 4 2 2 7 7 3 3 4 3 4 2 8 3 3 1 3
## [334] 2 2 3 4 3 2 2 3 2 7 1 1 4 7 3 2 4 4 2 3 4 6 2 2 3
## Objective function:
##    build     swap 
## 1.476726 1.407349 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"      
## 
## $nc
## [1] 8
## 
## $crit
##  [1] 0.0000000 0.3981429 0.4757789 0.3309225 0.4092174 0.4828333 0.5064075
##  [8] 0.5333680 0.4885942 0.4999191

Since dataset contains the meaningful clusters, I’ll choose 3 and 8 clusters for futher anylasis.

Clustering based on location will be performed using: K-MEANS, DBSCAN, PAM and Hierarchical clustering methods.

K-MEANS

3 clusters

cluster_km3 <- kmeans(Europe[,3:4],3)
plot(Europe$Latitude,Europe$Longitude,col = cluster_km3$cluster, pch = 19, xlab = "Latitude", ylab = "Longitude", main = "Clustering with K-means method - 3 clusters")

europe_kmeans3 <- data.frame(Europe[,3:4], cluster = as.factor(cluster_km3$cluster))
pal <- colorFactor(palette = "Set1", domain = europe_kmeans3$cluster)
leaflet(data = europe_kmeans3) %>%
  addTiles() %>%
  addCircleMarkers(~Longitude, ~Latitude,color = ~pal(cluster),
    radius = 4, fill = TRUE, fillOpacity = 0.5, stroke = FALSE, popup = ~paste("Cluster:", cluster)
  ) %>%
  addLegend(position = "topleft", pal = pal, values = europe_kmeans3$cluster, title = "3 clusters - K-MEANS")

Clusters represents 3 different regions: - 1-Scandinavia, Italy, Austria, Germany, Poland and Slovenia - 2- Poland, Slovenia, Ukraine and Balkans region - 3- UK, Spain, France and Switzerland

8 clusters

Short description on clusters: - Cluster 1 is for ski resorts in Kaukaz - Russia resort - Clusters 2, 5, 7 are close to each other. This is the region of Alps and it contains ski resorts from: France, Germany, Austria, Italy and Switzerland mostly. - Cluster 3 presents ski resorts in Finland - Cluster 4 contains ski resorts from France and Spain - mainly the area of the Pyrenees Mountains - Cluster 6 is for Scandinavia mountains and ski resorts based in UK - Cluster 8 is scattered on the map, points are not located close to each other. It contains ski resorts located next to Carpathian/Tatra Mountains and region of Balkans – Dinaric Mountains

DBSCAN

eur <- Europe[, 3:4]
dbscan::kNNdistplot(eur, k =  3) # looking for optimal eps
abline(h = 1.5, lty = 2) #eps=1.5 seems to be the optimal level

db <- fpc::dbscan(eur, eps = 1.5, MinPts = 3)
plot(db, eur, main = "DBSCAN", frame = FALSE) # Plot DBSCAN results

fviz_cluster(db, eur, stand = FALSE, ellipse = FALSE, geom = "point")

dbscan8 <- data.frame(eur, cluster = as.factor(db$cluster))
pal <- colorFactor(palette = "Set1", domain = dbscan8$cluster) 
leaflet(data = dbscan8) %>%
  addTiles() %>%
  addCircleMarkers(~Longitude, ~Latitude,  color = ~pal(cluster),
    radius = 5, fill = TRUE, fillOpacity = 0.5, stroke = FALSE,   
    popup = ~paste("Cluster:", cluster)
  ) %>%
  addLegend(position = "topleft", pal = pal, values = dbscan8$cluster, title = "Clusters - DBSCAN")

DBSCAN suggests that 8 clusters are optimal however cluster 0 indicates the presence of outliers. Here, the outliers might result from using Euclidean distance, which is not ideal for larger regions as it does not account for the curvature of the Earth. Descrition of clusters: 1. Alps region – France, Italy, Austria, Switzerland, Slovenia, Germany 2. Scandinavian Mountains – Norway only 3. Pyrenees Mountains – France, Spain, Andorra 4. Rhodope/Old Balkan Mountains – Bulgaria 5. Caucasus Mountains – Russia 6. Scandinavian Mountains – Sweden/Norway 7. Carpathian Mountains – Poland, Slovakia 8. Scandinavian Mountains – Norway

Outlier observations were classified from: the UK, Lithuania, Germany, Romania, and other regions that are not located close to the main mountain ranges in the area.

The presence of outliers has influenced the distribution of the clusters, so they do not fully align with the results of the K-Means clustering method.

PAM

3 clusters

cluster_pam3<-eclust(eur, "pam", k= 3)

eur_pam3 <- data.frame(eur, cluster = as.factor(cluster_pam3$cluster))
pal <- colorFactor(palette = "Set1", domain = eur_pam3$cluster)
leaflet(data = eur_pam3) %>%
  addTiles() %>%
  addCircleMarkers( ~Longitude, ~Latitude, color = ~pal(cluster),
    radius = 7, fill = TRUE, fillOpacity = 0.5, stroke = FALSE, popup = ~paste("Cluster:", cluster)
  ) %>%
  addLegend(position = "topleft", pal = pal, values = eur_pam3$cluster, title = "PAM -3 clusters")

Cluster 1 is for ski resorts in Scandinavia and partially the UK. Cluster number 2 includes the Carpathians, Alps, Caucasus, Dinaric,Alps, and other regions. Cluster 3 contains ski resorts near the Pyrenees and the Alps.

8 clusters

Here’s the division of ski resorts based on PAM method.

1- Scandinavian ski resorts – Norway, Sweden, and Finland 2- Alps – Germany, Switzerland, Liechtenstein, Austria, Italy 3- Alps – Czech Republic, Austria, Slovenia, Italy 4- Alps – Mostly France and Switzerland 5- UK ski resorts 6- Ski resorts near the Pyrenees, also including resorts in France and Spain 7- Eastern Europe – Carpathians, Dinaric Alps – Mainly ski resorts in Poland, Lithuania, Bulgaria, Slovakia 8- Ski resorts in the Caucasus – Russia only

The 8 clusters with the PAM method are better represented on the map than the 8 clusters provided by the DBSCAN method, as there are no outliers.

Results

To sum up, I’ve performed clustering using three methods: K-MEANS, DBSCAN, and PAM. The three clusters looked similar in all these methods; however, the biggest change was observed while performing clustering with 8 clusters. The Alps region contains enough resorts that, regardless of the method used, it always included 2 or 3 clusters out of 8. When clustering resorts based on their location, the PAM method performed the best on the map.

The statistics of outputs of models, tested for both 3 and 8 clusters

round(calinhara(eur, cluster_km3$cluster),digits=2) #K-means with 3 clusters

## [1] 206.41

round(calinhara(eur, cluster_km8$cluster),digits=2) #K-means with 8 clusters

## [1] 468.93

round(calinhara(eur, db$cluster),digits=2) #DBscan with 8 clusters

## [1] 356.16

round(calinhara(eur, cluster_pam3$cluster),digits=2) #PAM with 3 clusters

## [1] 199.17

round(calinhara(eur, cluster_pam8$cluster),digits=2) #PAM with 8 clusters

## [1] 466.52

Calinski - Harabasz index - The statistic is usually used for comparing solutions for alternative number of clusters. Based on this index, the higher statistic the better- Pam with 8 clusters seems to be the best method. However results of 8 clusters done by K-means and DBSCAN method are also high. Based on this index we see that 8 clusters are better solution than 3 one.

Hierarchical clustering try based on location also

hc2 <- agnes(eur, method = "complete") # the same with different function - agnes
hc2$ac

## [1] 0.9916091

# agglomerative coefficient measures the amount of clustering structure found, values closer to 1 suggest strong clustering structure

# multiple methods to assess
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
ac <- function(x) {
  agnes(eur, method = x)$ac
}
map_dbl(m, ac) #Ward method has the highest score

##   average    single  complete      ward 
## 0.9882083 0.9786959 0.9916091 0.9963802

# cutting the tree, Ward's method
hc3 <- hclust(d, method = "ward.D2" )
# cut tree into 5 groups
sub_grp <- cutree(hc3, k = 5)
table(sub_grp)# number of members in each cluster

## sub_grp
##   1   2   3   4   5 
##  19 165 118  33  23

More than 5 clusters means creating additional clusters with only 4 or 5 observations. Number of clusters 3 and 4 gives similar amount of observations in clusters. I think 5 would be the best for this dataset, I’ll use Ward’s method.

Plots with borders

plot(hc3, cex = 0.6)
rect.hclust(hc3, k = 5, border = 2:5)

fviz_cluster(list(data = eur, cluster = sub_grp))

Hierarchical clustering presented on map

eur_5 <- data.frame(eur, cluster = as.factor(sub_grp))
pal <- colorFactor(palette = "Set1", domain = eur_5$cluster)
leaflet(data = eur_5) %>%
  addTiles() %>%
  addCircleMarkers( ~Longitude, ~Latitude, color = ~pal(cluster),
                    radius = 7, fill = TRUE, fillOpacity = 0.5, stroke = FALSE, popup = ~paste("Cluster:", cluster)
  ) %>%
  addLegend(position = "topright", pal = pal, values = eur_5$cluster, title = "Hierarchical clustering -5 clusters")

Hierarchical clustering with 5 clusters shows ski resorts located in Scandinavia, Eastern Europe, Western Europe, and two clusters in the Alps region.

Part 2 - clustering based on variables that describe ski resorts

Clustering results differ based on the number of clusters. I would like to explore these differences further and understand their origins. Specifically, I aim to investigate how the quality and facilities of ski resorts might impact the clustering outcomes.

I’ll take into consideration: ski pass price, total number of slopes and their length, total number of lifts and their capacity. Number of snow cannons, child friendliness and possibility of skiing at night will also be included.

eur_sample<-Europe[c("Price","Total.slopes","Lift.capacity","Longest.run","Snow.cannons","Total.lifts","Child.friendly","Nightskiing")]
eur_sample <- subset(eur_sample, Price!= 0)
eur_sample <- subset(eur_sample, Lift.capacity!= 0)
str(eur_sample)

## tibble [354 × 8] (S3: tbl_df/tbl/data.frame)
##  $ Price         : num [1:354] 46 44 48 45 43 22 20 35 81 54 ...
##  $ Total.slopes  : num [1:354] 43 34 26 44 40 4 8 34 322 175 ...
##  $ Lift.capacity : num [1:354] 22921 14225 16240 21060 11900 ...
##  $ Longest.run   : num [1:354] 6 2 9 6 0 0 6 3 16 10 ...
##  $ Snow.cannons  : num [1:354] 325 100 123 150 40 0 0 0 1060 630 ...
##  $ Total.lifts   : num [1:354] 21 24 11 18 11 4 4 21 63 84 ...
##  $ Child.friendly: num [1:354] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Nightskiing   : num [1:354] 1 1 0 1 0 0 1 1 0 0 ...

summary(eur_sample)

##      Price       Total.slopes    Lift.capacity     Longest.run   
##  Min.   :17.0   Min.   :  1.00   Min.   :   900   Min.   : 0.00  
##  1st Qu.:34.0   1st Qu.: 26.00   1st Qu.: 11907   1st Qu.: 0.00  
##  Median :43.0   Median : 50.00   Median : 19760   Median : 3.00  
##  Mean   :41.9   Mean   : 90.14   Mean   : 36449   Mean   : 3.76  
##  3rd Qu.:49.0   3rd Qu.:100.00   3rd Qu.: 38400   3rd Qu.: 7.00  
##  Max.   :81.0   Max.   :600.00   Max.   :252280   Max.   :16.00  
##   Snow.cannons     Total.lifts     Child.friendly    Nightskiing    
##  Min.   :   0.0   Min.   :  1.00   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:   0.0   1st Qu.: 11.00   1st Qu.:1.0000   1st Qu.:0.0000  
##  Median :  65.5   Median : 17.00   Median :1.0000   Median :0.0000  
##  Mean   : 230.2   Mean   : 28.21   Mean   :0.9972   Mean   :0.4181  
##  3rd Qu.: 265.2   3rd Qu.: 30.00   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :2383.0   Max.   :174.00   Max.   :1.0000   Max.   :1.0000

hopkins(eur_sample, nrow(eur_sample) -1)

## [1] 0.9999989

Rejecting the null hypothesis that the dataset is uniformly distributed. The Hopkins statistic is very close to 1, indicating that our data is highly clusterable.

Looking for optimal number of clusters:

opt1<-Optimal_Clusters_KMeans(eur_sample, max_clusters=10, plot_clusters = TRUE) #elbow method suggests 2 clusters

opt2<-Optimal_Clusters_KMeans(eur_sample, max_clusters=10, plot_clusters=TRUE, criterion="silhouette") #according to silhouette optimal number of clusters is 2, however 3 clusters would also work out

fviz_nbclust(eur_sample, kmeans, method = "wss")  #2 clusters  suggested

Optimal number clusters chosen =2

Average Silhouette Width =0.71- value of 0.71 indicates that the clustering structure is strong and well-defined.

Clustering based on variables that describe ski resorts will be performed using K-MEANS, PAM, CLARA and DBSCAN method.

K-MEANS

cluster_km <- kmeans(eur_sample, 2)
cluster_km$centers

##      Price Total.slopes Lift.capacity Longest.run Snow.cannons Total.lifts
## 1 40.09061     53.99676      21791.44    3.553398      124.343    18.35599
## 2 54.35556    338.35556     137101.38    5.177778      957.200    95.88889
##   Child.friendly Nightskiing
## 1      0.9967638   0.3786408
## 2      1.0000000   0.6888889

The first cluster contains cheaper ski resorts with lower-quality slopes and facilities. Both clusters appear to be almost identical in terms of child-friendliness.

The second cluster consists of ski resorts with higher ski pass prices, a greater number of total slopes and lifts, larger lift capacities, and longer runs. It also has more snow cannons and more options for night skiing.

PAM

cluster_pam<-eclust(eur_sample, "pam", k= 2)

cluster_pam$medoids

##      Price Total.slopes Lift.capacity Longest.run Snow.cannons Total.lifts
## [1,]    50           60         16381           8           75          15
## [2,]    81          322         93464           0         1060          63
##      Child.friendly Nightskiing
## [1,]              1           1
## [2,]              1           0

Results similar to the output of K-MEANS method. Differences: according to PAM method more expensive ski resorts medoids do not have nightskiing possibilities and longer ski runs. (as it was in centroids in K-MEANS)

CLARA

cluster_clara<-eclust(eur_sample, "clara", k=2)

cluster_clara

## Call:     fun_clust(x = x, k = k) 
## Medoids:
##      Price Total.slopes Lift.capacity Longest.run Snow.cannons Total.lifts
## [1,]    30           65         17948           0          100          22
## [2,]    55          179         92510           8         1102          54
##      Child.friendly Nightskiing
## [1,]              1           0
## [2,]              1           1
## Objective function:   15191.25
## Clustering vector:    int [1:354] 1 1 1 1 1 1 1 1 2 2 1 2 1 1 1 1 1 1 ...
## Cluster sizes:            293 61 
## Best sample:
##  [1]   2   9  33  54  57  62  67  68  74  80  86  93 109 111 118 119 121 135 138
## [20] 142 149 153 155 157 164 179 180 187 194 226 260 262 265 270 275 278 284 306
## [39] 311 313 336 338 343 352
## 
## Available components:
##  [1] "sample"     "medoids"    "i.med"      "clustering" "objective" 
##  [6] "clusinfo"   "diss"       "call"       "silinfo"    "data"      
## [11] "clust_plot" "nbclust"

The results of CLARA show quite similar differences in clusters as K-Means.

DBSCAN

## dbscan Pts=354 MinPts=9 eps=1500
##         0   1  2
## border 69   6  8
## seed    0 269  2
## total  69 275 10

According to DBSCAN method: Cheaper ski resorts have less slopes which are shorter, less snow cannons and they have less ski lifts which have lower capacity. Cheaper ski resorts have also lower number of child friendly ski resorts and poorer offer of night skiing compared to more expensive resorts. Again, we can conclude that the price of the ski pass goes hand in hand with the quality of the resort and its facilities.

Results of K-MEANS, PAM, CLARA and DBSCAN

The differences between the clusters stem primarily from the ski pass price, the number of slopes, and their length. The number of lifts and their capacity are also important factors, as well as the availability of snow cannons.

In general, the “cheaper” cluster across all methods tends to have fewer slopes, a lower number of snow cannons, fewer lifts, and lower lift capacities. We can assume that a higher price is associated with better slopes and facilities. However, child-friendly resorts and night skiing options do not contribute significantly to the clustering results.

CONCLUSIONS

To sum up, ski resort data was clustered using different methods. I performed clustering based on location as well as on variables describing the ski resorts. As part of the preprocessing, I selected only European ski resorts because the clustering methods presented do not work perfectly with the curvature of the Earth, and working with a smaller region reduces the possibility of incorrect calculations, errors or outliers.

Clustering based on location resulted in varying numbers of optimal clusters, with this number changing depending on the clustering method used. However, when it came to the features and characteristics of the ski resorts, all methods showed the same results.

Clustering based on location primarily placed clusters around the same mountain regions, with the Alps showing 2 or 3 clusters due to the larger number of ski resorts in that area. Meanwhile, clustering based on features revealed two clusters that differed in quality and service price. This type of data is definitely easier to cluster than location-based data.

In conclusion, ski resorts are more differentiated based on their location than based on their features.

Clustering on ski resorts

Izabela Kosiec 441755

2025-01-01

Clustering among Ski Resorts

Preprocessing

Part 1 - clustering based on location only

How many clusters will be the best?

Clustering based on location will be performed using: K-MEANS, DBSCAN, PAM and Hierarchical clustering methods.

K-MEANS

3 clusters

8 clusters

DBSCAN

PAM

3 clusters

8 clusters

Results

The statistics of outputs of models, tested for both 3 and 8 clusters

Hierarchical clustering try based on location also

Part 2 - clustering based on variables that describe ski resorts

Looking for optimal number of clusters:

Clustering based on variables that describe ski resorts will be performed using K-MEANS, PAM, CLARA and DBSCAN method.

K-MEANS

PAM

CLARA

DBSCAN

Results of K-MEANS, PAM, CLARA and DBSCAN

CONCLUSIONS

Clustering on ski resorts

Izabela Kosiec 441755

2025-01-01

Clustering among Ski Resorts

Preprocessing

Part 1 - clustering based on location only

How many clusters will be the best?

Clustering based on location will be performed using: K-MEANS, DBSCAN, PAM and Hierarchical clustering methods.

K-MEANS

3 clusters

8 clusters

DBSCAN

PAM

3 clusters

8 clusters

Results

The statistics of outputs of models, tested for both 3 and 8 clusters

Hierarchical clustering try based on location also

** Part 2 - clustering based on variables that describe ski resorts**

Looking for optimal number of clusters:

Clustering based on variables that describe ski resorts will be performed using K-MEANS, PAM, CLARA and DBSCAN method.

K-MEANS

PAM

CLARA

DBSCAN

Results of K-MEANS, PAM, CLARA and DBSCAN

CONCLUSIONS

Part 2 - clustering based on variables that describe ski resorts