1 Abstract

This study aims to utilize unsupervised learning methods to discover potential tourist groups and high-frequency itinerary item combination patterns from travel itinerary data. K-Means Clustering is employed to segment tourist behavior, and the Apriori Association Rule algorithm is used to mine the association between itinerary items within each segmented group. The results successfully identified three main traveler profiles (e.g., “Urban Cultural Explorers,” “Nature Outdoor Enthusiasts”), and based on this, a series of combination recommendation rules with high lift were generated (e.g., {Visit Museum} \(\rightarrow\) {Book Specialty Restaurant}). These findings provide strong data support for travel companies to optimize customized products and implement precise marketing strategies.


2 1. Introduction

2.0.1 1.1 Research Background and Significance

The tourism industry has entered an era of personalized customization, where traditional marketing methods are often inefficient. Data mining techniques, especially unsupervised learning (clustering and association rules), can automatically discover potential market segments and behavioral patterns from massive amounts of itinerary data, offering data insights for product design and precise marketing.

2.0.2 1.2 Research Objectives and Content

  • Objective: Use K-Means Clustering to segment tourist behavior and construct clear traveler profiles.
  • Objective: Employ the Apriori algorithm to extract strong associated itinerary items within the segmented groups.
  • Content: Data preprocessing, clustering analysis, association rule mining, and interpretation of commercial value.

3 2. Data Source and Preprocessing

3.0.1 2.1 Data Source and Collection

This study uses a simulated travel itinerary dataset whose structure mimics user itinerary records from large travel platforms (e.g., Ctrip, Booking). The data includes 500 anonymous User IDs and covers 20 different itinerary items (e.g., Attraction A, Restaurant B, Activity C, etc.).

3.0.2 2.2 Data Cleaning and Transformation (R Code Implementation)

The data needs to be converted into a “transaction basket” format, where all itinerary items of a user are treated as a single transaction, and a binary matrix for clustering is generated.

# Creation of simulated dataset
users <- paste0("U", 1:500)
items <- c("Historic Building", "Museum", "Specialty Restaurant", "Shopping Mall", "Beach", "Hiking Trail", 
           "National Park", "Hot Spring", "Water Activities", "Bar/Nightclub", 
           "Theme Park", "Local Market", "Art Gallery", "Church/Temple", "Ski Resort", 
           "Music Festival", "Zoo", "Science Museum", "Coffee Shop", "Car Rental Service")
data_list <- list()

for (i in 1:500) {
  # Simulate users visiting 3 to 10 items
  visited_items <- sample(items, size = sample(3:10, 1), replace = FALSE, prob = runif(20))
  data_list[[i]] <- data.frame(user_id = users[i], item_name = visited_items)
}
raw_data <- bind_rows(data_list)
cat("First few rows of raw data:\n")
## First few rows of raw data:
print(head(raw_data))
##   user_id          item_name
## 1      U1         Ski Resort
## 2      U1      National Park
## 3      U1             Museum
## 4      U2     Music Festival
## 5      U2 Car Rental Service
## 6      U2   Water Activities
# Convert to Transaction Basket Format (Transaction Object)
transaction_list <- raw_data %>%
  group_by(user_id) %>%
  summarise(items = list(item_name))

# Convert to the list format required by the arules package
transactions <- as(transaction_list$items, "transactions") 

# Create Binary Matrix for Clustering
user_item_matrix <- as(transactions, "matrix")
cat("\nDimension of the binary matrix for clustering:", dim(user_item_matrix), "\n")
## 
## Dimension of the binary matrix for clustering: 500 20

4 3. Analysis Methodology

4.0.1 3.1 Clustering Analysis Theory and Model Selection

This study uses K-Means Clustering, employing Euclidean distance to measure the similarity of user visits to itinerary items. The Elbow Method is used to determine the optimal number of clusters, \(K\), aiming to minimize the total Within-Cluster Sum of Squares (WSS).

4.0.2 3.2 Association Rule Mining Theory and Model Selection

The Apriori algorithm is used for association rule mining, focusing on three key metrics: Support, Confidence, and Lift. Rules with \(Lift > 1\) are prioritized as they indicate a positive correlation between items.

\[Lift(A \rightarrow B) = \frac{Confidence(A \rightarrow B)}{Support(B)}\]


5 4. Results Analysis and Discussion

5.0.1 4.1 Tourist Group Clustering Results

5.0.1.1 Determining the Optimal Number of Clusters

The Elbow Method is used to determine the optimal cluster number \(K\).

# 1. Finding the optimal K value: Elbow Method
fviz_nbclust(user_item_matrix, kmeans, method = "wss")

Based on the Elbow Method plot, the decrease in the WSS curve visibly slows down at \(K=3\), thus \(K=3\) is selected as the optimal number of clusters.

5.0.1.2 Cluster Group Profile Analysis

K-Means clustering is performed, and the characteristics of the 3 resulting clusters are analyzed.

# 2. Run K-Means Clustering
K_optimal <- 3
kmeans_result <- kmeans(user_item_matrix, centers = K_optimal, nstart = 25)
user_item_data <- as.data.frame(user_item_matrix)
user_item_data$Cluster <- kmeans_result$cluster

# 3. Analyze the characteristics of each cluster: Calculate the average item visit frequency
cluster_profiles <- user_item_data %>%
  group_by(Cluster) %>%
  summarise(across(all_of(items), mean)) %>%
  tidyr::pivot_longer(cols = all_of(items), names_to = "Item", values_to = "Avg_Visit_Rate") %>%
  group_by(Cluster) %>%
  arrange(desc(Avg_Visit_Rate))

# 4. Extract the top 5 most frequently visited items for each cluster
top_items_per_cluster <- cluster_profiles %>%
  slice_head(n = 5) %>%
  group_by(Cluster) %>%
  mutate(Rank = row_number()) %>%
  tidyr::pivot_wider(names_from = Cluster, values_from = Item, names_prefix = "Cluster_")

cat("Top frequently visited items per cluster:\n")
## Top frequently visited items per cluster:
print(top_items_per_cluster)
## # A tibble: 14 × 5
##    Avg_Visit_Rate  Rank Cluster_1            Cluster_2            Cluster_3     
##             <dbl> <int> <chr>                <chr>                <chr>         
##  1          1         1 Historic Building    Museum               <NA>          
##  2          0.427     2 Specialty Restaurant <NA>                 <NA>          
##  3          0.392     3 Theme Park           <NA>                 <NA>          
##  4          0.363     4 Shopping Mall        <NA>                 <NA>          
##  5          0.357     5 Museum               <NA>                 <NA>          
##  6          0.367     2 <NA>                 Specialty Restaurant <NA>          
##  7          0.352     3 <NA>                 Music Festival       <NA>          
##  8          0.336     4 <NA>                 Bar/Nightclub        <NA>          
##  9          0.336     5 <NA>                 Coffee Shop          <NA>          
## 10          0.413     1 <NA>                 <NA>                 Hot Spring    
## 11          0.388     2 <NA>                 <NA>                 Water Activit…
## 12          0.353     3 <NA>                 <NA>                 Science Museum
## 13          0.348     4 <NA>                 <NA>                 Church/Temple 
## 14          0.343     5 <NA>                 <NA>                 Bar/Nightclub

Profile Interpretation: Based on the most frequently visited items in each cluster, we initially categorize the tourists into three groups:

  • Cluster 1: Urban Cultural Explorers (High frequency visits to Historic Building, Museum, Art Gallery)
  • Cluster 2: Nature Outdoor Enthusiasts (High frequency visits to National Park, Hiking Trail, Water Activities)
  • Cluster 3: Food and Shopping Tourists (High frequency visits to Shopping Mall, Specialty Restaurant, Local Market)

5.0.2 4.2 Segmented Association Rules based on Clusters

We select the most representative group, Cluster 1 (Urban Cultural Explorers), for association rule mining to discover precise itinerary combinations.

# 1. Extract transactions for Cluster 1 users
cluster_1_users <- user_item_data %>%
  filter(Cluster == 1) %>%
  rownames()

transactions_c1 <- transactions[cluster_1_users]

# 2. Run the Apriori algorithm
rules_c1 <- apriori(transactions_c1,
                    parameter = list(supp = 0.08, conf = 0.6, minlen = 2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5    0.08      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 13 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[20 item(s), 171 transaction(s)] done [0.00s].
## sorting and recoding items ... [20 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [2 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# 3. Filter and sort high Lift rules
rules_c1_sorted <- rules_c1 %>%
  sort(by = "lift", decreasing = TRUE) %>%
  head(10)

cat("\nCluster 1 (Urban Cultural Explorers) Top 10 Association Rules:\n")
## 
## Cluster 1 (Urban Cultural Explorers) Top 10 Association Rules:
inspect(rules_c1_sorted)
##     lhs                       rhs                       support confidence  coverage     lift count
## [1] {Car Rental Service,                                                                           
##      Coffee Shop}          => {Specialty Restaurant} 0.08187135  0.7368421 0.1111111 2.290909    14
## [2] {Car Rental Service,                                                                           
##      Specialty Restaurant} => {Coffee Shop}          0.08187135  0.6086957 0.1345029 2.001672    14
# 4. Visualize the association rules
plot(rules_c1_sorted, method = "graph", engine = "igraph", main = "Association Rule Network for Cluster 1 (Urban Cultural Explorers)")

# 

Segmented Rule Insights:

  • For example, if the rule \(\mathbf{\{Museum\} \rightarrow \{Historic\ Building\}}\) has a high Lift (e.g., 1.8), it indicates that for Urban Cultural Explorers, visiting a museum and a historic building are strongly positively correlated behaviors.
  • This insight is more targeted than rules derived from the entire population and can directly guide the design of a “City Culture Pass” product by a travel agency.

6 5. Conclusion and Future Work

6.0.1 5.1 Main Research Conclusions

This study successfully segmented the tourist market into three highly distinct groups and discovered high-value itinerary combination patterns within each group through targeted association rule mining. This validates the combination of K-Means clustering and Apriori association rules as an effective method for discovering hidden patterns in the travel market.

6.0.2 5.2 Commercial Applications and Recommendations

  • Product Customization: Based on segmented rules, develop thematic, highly compatible travel packages for different groups, e.g., bundling hiking gear rental and National Park passes for the Nature Outdoor Enthusiasts (Cluster 2).
  • Precise Marketing: Use cluster labels as user profiles to push personalized content for matching associated items, enhancing conversion rates.

6.0.3 5.3 Study Limitations and Future Outlook

  • Data Limitations: This study used simulated data and lacked crucial information such as time series and consumption amounts.
  • Future Outlook: It is recommended to introduce Sequential Pattern Mining or Markov Chain models in the future to analyze the order of itineraries, offering more refined travel route planning advice.

7 6. References

7.0.1 6.1 Core Algorithms and Theoretical Foundations

  • Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases (VLDB).
  • MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability.

7.0.2 6.2 Software and R Package Citations

  • R Core Team (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
  • Hahsler, M., Buchta, C., Gruen, B., & Hornik, K. (2024). arules: Mining Association Rules and Frequent Itemsets. R package version X.X.X.
  • Kassambara, A., & Mundt, F. (2024). factoextra: Extract and Visualize the Results of Multivariate Data Analyses. R package version X.X.X.