Japan Hostel Clustering using K Means Algorithm

Objective

The objective of this model is to create a clustering of hostel in Japan based on several parameters including:

Starting price (JPY)
Distance from city centre
Rating score
Rating band
Atmosphere score
Cleanliness score
Facilities score
Location score
Security score
Staff score
Value for money score

This model also can be used as recommendation system, for example when someone wants to find similar hotel in different location

Data Source

The data is obtained from this link https://www.kaggle.com/koki25ando/hostel-world-dataset

Library

Add several libraries

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(ggplot2)

Data Preprocessing

Data Input

db <- read.csv("Hostel.csv")
head(db)

##   X                  hostel.name         City price.from
## 1 1 "Bike & Bed" CharinCo Hostel        Osaka       3300
## 2 2                 & And Hostel Fukuoka-City       2600
## 3 3        &And Hostel Akihabara        Tokyo       3600
## 4 4             &And Hostel Ueno        Tokyo       2600
## 5 5   &And Hostel-Asakusa North-        Tokyo       1500
## 6 6       1night1980hostel Tokyo        Tokyo       2100
##   Distance.from.city.centre summary.score rating.band atmosphere cleanliness
## 1                       2.9           9.2      Superb        8.9         9.4
## 2                       0.7           9.5      Superb        9.4         9.7
## 3                       7.8           8.7    Fabulous        8.0         7.0
## 4                       8.7           7.4   Very Good        8.0         7.5
## 5                      10.0           9.4      Superb        9.5         9.5
## 6                       9.4           7.0   Very Good        5.5         8.0
##   facilities location.y security staff valueformoney      lon      lat
## 1        9.3        8.9      9.0   9.4           9.4 135.5138 34.68268
## 2        9.5        9.7      9.2   9.7           9.5       NA       NA
## 3        9.0        8.0     10.0  10.0           9.0 139.7775 35.69745
## 4        7.5        7.5      7.0   8.0           6.5 139.7837 35.71272
## 5        9.0        9.0      9.5  10.0           9.5 139.7984 35.72790
## 6        6.0        6.0      8.5   8.5           6.5 139.7869 35.72438

Check Missing Value

colSums(is.na(db))

##                         X               hostel.name                      City 
##                         0                         0                         0 
##                price.from Distance.from.city.centre             summary.score 
##                         0                         0                        15 
##               rating.band                atmosphere               cleanliness 
##                        15                        15                        15 
##                facilities                location.y                  security 
##                        15                        15                        15 
##                     staff             valueformoney                       lon 
##                        15                        15                        44 
##                       lat 
##                        44

Remove Missing Value

db <- db %>%
  drop_na()

Select Column for Modelling

db_model <- db %>%
  select(-c(X, hostel.name, City, rating.band, lon, lat))

Scaling

As the data is not having same range, scaling is required

db_model <- db_model %>%
  scale()

Modelling

Determine Number of Clusters

fviz_nbclust(x = db_model, method = "wss", kmeans)

According to the WSS Chart, the choosen number of cluster is 7

K-Means Clustering

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(100)
hostel_cluster <- kmeans(x = db_model, centers = 7)

Result

db_cluster <- db %>%
  mutate(cluster = hostel_cluster$cluster)

Visualisation

head(db_cluster)

##   X                  hostel.name  City price.from Distance.from.city.centre
## 1 1 "Bike & Bed" CharinCo Hostel Osaka       3300                       2.9
## 2 3        &And Hostel Akihabara Tokyo       3600                       7.8
## 3 4             &And Hostel Ueno Tokyo       2600                       8.7
## 4 5   &And Hostel-Asakusa North- Tokyo       1500                      10.0
## 5 6       1night1980hostel Tokyo Tokyo       2100                       9.4
## 6 7          328 Hostel & Lounge Tokyo       3300                      16.0
##   summary.score rating.band atmosphere cleanliness facilities location.y
## 1           9.2      Superb        8.9         9.4        9.3        8.9
## 2           8.7    Fabulous        8.0         7.0        9.0        8.0
## 3           7.4   Very Good        8.0         7.5        7.5        7.5
## 4           9.4      Superb        9.5         9.5        9.0        9.0
## 5           7.0   Very Good        5.5         8.0        6.0        6.0
## 6           9.3      Superb        8.7         9.7        9.3        9.1
##   security staff valueformoney      lon      lat cluster
## 1      9.0   9.4           9.4 135.5138 34.68268       7
## 2     10.0  10.0           9.0 139.7775 35.69745       4
## 3      7.0   8.0           6.5 139.7837 35.71272       1
## 4      9.5  10.0           9.5 139.7984 35.72790       4
## 5      8.5   8.5           6.5 139.7869 35.72438       1
## 6      9.3   9.7           8.9 139.7455 35.54804       4

db_cluster <- db_cluster %>%
  mutate(City = as.factor(City))%>%
  mutate(cluster = as.factor(cluster))

## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)

Cluster Distribution based on Location

Hotel Cluster in Fukuoka

ggplot(db_fukuoka, aes(x = cluster, y = count))+
  geom_col(aes(fill = cluster))

Hotel Cluster in Hiroshima

ggplot(db_hiroshima, aes(x = cluster, y = count))+
  geom_col(aes(fill = cluster))

Hotel Cluster in Kyoto

ggplot(db_kyoto, aes(x = cluster, y = count))+
  geom_col(aes(fill = cluster))

Hotel Cluster in Osaka

ggplot(db_osaka, aes(x = cluster, y = count))+
  geom_col(aes(fill = cluster))

Hotel Cluster in Tokyo

ggplot(db_tokyo, aes(x = cluster, y = count))+
  geom_col(aes(fill = cluster))

Price Mean Based on Cluster

db_cluster <- db_cluster %>%
  mutate(cluster = as.factor(cluster))

ggplot(db_cluster, aes(x = cluster, y = mean(price.from)))+
  geom_col(aes(fill = cluster))

### Rating Distribution Based on Cluster

rating <- db_cluster %>%
  select(c(summary.score, atmosphere, cleanliness, facilities,location.y,security,staff, valueformoney, cluster))%>%
  group_by(cluster)%>%
  summarise_all(mean)

rating %>%
   pivot_longer(cols = -cluster, names_to = "type", values_to = "value") %>%
   ggplot(aes(x=as.factor(cluster), y =value)) +
   geom_col(aes(fill = cluster))+
   facet_wrap(~type)

Closing

Seven cluster are formed using K-NN method, each cluster has each own characteristic in terms of ratings, price, also location. This model could be used as recommendation system, for example a person usually stays at Hostel cluster 1 in Tokyo, when he wants to stay in Osaka the algorithm will recommend the Hostel cluster 1 that located in Osaka as that hostel has similar characteristic with the one that he has stayed before