This project uses the Amazon Cell Phones Reviews dataset, which contains large-scale customer review data collected from Amazon product listings in the mobile phone and smartphone category. The dataset combines product-level metadata with review-level textual and rating information, enabling joint analysis of customer sentiment, product attributes, and behavioral patterns.
The dataset is provided as two main relational tables:
items.csv)This file contains one row per product and provides structured information describing each mobile device. Key attributes include:
asin – unique product identifier used for merging across tables
brand – manufacturer name
title – product name and description
price – listed selling price
originalPrice – reference price (often missing or zero)
rating – average product rating
totalReviews – total number of reviews per product
url and image – product listing references
These attributes provide contextual information about pricing, brand positioning, and overall product popularity.
reviews.csv)This file contains individual customer review records linked to
products via the asin identifier. Each row corresponds to a
single review instance and includes:
rating – individual star rating (1–5 scale)
title – short review headline
body – full review text
date – review timestamp
verified – verified purchase indicator
helpfulVotes – user feedback on review usefulness
This table captures fine-grained customer opinions and textual feedback that reflect real usage experiences and sentiment expression.
For analysis, the two tables are merged using the asin
product identifier, resulting in a review-level dataset where each
observation contains:
Product characteristics (brand, price, popularity)
Individual rating outcomes
Free-text customer feedback
After preprocessing and filtering invalid price entries, the final dataset contains more than 56,000 review transactions, making it suitable for large-scale association rule mining and pattern discovery.
The dataset is well suited for market basket–style analysis because it combines:
categorical product attributes (brand, price tiers)
discretized rating outcomes (poor to excellent)
binary review theme indicators extracted from text (quality mentions, issue reports, battery discussion, design feedback, etc.)
This structure enables the discovery of co-occurrence relationships between product characteristics, review content themes, and customer satisfaction outcomes, supporting both behavioral interpretation and applied recommendation insights.
#install.packages("arules")
#install.packages("arulesViz")
library(arules)
library(arulesViz)
In this step, the two raw files are loaded and merged using the
product identifier (asin) to create a review-level dataset
enriched with product metadata. This merge produces a single table where
each row corresponds to one review, while product attributes (brand,
price, product-level rating, etc.) are repeated for all reviews of the
same item.
A basic quality check is then performed on price, which
often contains missing values or placeholder zeros in scraped datasets.
Because price is later discretized into tiers (budget → luxury) and
treated as a categorical item, reviews with price ≤ 0 are
removed to avoid introducing invalid price categories that would weaken
interpretability and distort rule frequencies.
# Load your data
reviews <- read.csv("C:\\Users\\mevin\\Downloads\\USL\\20191226-reviews.csv", header=TRUE)
items <- read.csv("C:\\Users\\mevin\\Downloads\\USL\\20191226-items.csv", header=TRUE)
# Merge
merged_data <- merge(reviews, items, by="asin", all.x=TRUE)
summary(merged_data$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 104.0 180.0 222.1 300.6 1000.0
sum(merged_data$price == 0, na.rm = TRUE)
## [1] 11755
mean(merged_data$price == 0, na.rm = TRUE) * 100
## [1] 17.29032
merged_data <- merged_data[merged_data$price > 0, ]
Before feature engineering, it is necessary to confirm the structure and content of the merged dataset.
str(merged_data)
## 'data.frame': 56231 obs. of 17 variables:
## $ asin : chr "B0009N5L7K" "B0009N5L7K" "B0009N5L7K" "B0009N5L7K" ...
## $ name : chr "Marcel Thomas" "William B." "K. Mcilhargey" "Stephen Cahill" ...
## $ rating.x : int 1 4 5 1 5 1 5 4 1 1 ...
## $ date : chr "March 5, 2016" "February 9, 2006" "February 7, 2006" "December 20, 2016" ...
## $ verified : chr "true" "false" "false" "true" ...
## $ title.x : chr "Stupid phone" "Exellent Service" "I love it" "Phones locked" ...
## $ body : chr "DON'T BUY OUT OF SERVICE" "I have been with nextel for nearly a year now I started out this time last year with the Motorola i205 and just"| __truncated__ "I just got it and have to say its easy to use, i can hear the person talking just fine and i have had no proble"| __truncated__ "1 star because the phones locked so I have to pay additional fees to unlock it" ...
## $ helpfulVotes : int NA NA NA NA NA NA NA NA NA NA ...
## $ brand : chr "Motorola" "Motorola" "Motorola" "Motorola" ...
## $ title.y : chr "Motorola I265 phone" "Motorola I265 phone" "Motorola I265 phone" "Motorola I265 phone" ...
## $ url : chr "https://www.amazon.com/Motorola-i265-I265-phone/dp/B0009N5L7K" "https://www.amazon.com/Motorola-i265-I265-phone/dp/B0009N5L7K" "https://www.amazon.com/Motorola-i265-I265-phone/dp/B0009N5L7K" "https://www.amazon.com/Motorola-i265-I265-phone/dp/B0009N5L7K" ...
## $ image : chr "https://m.media-amazon.com/images/I/419WBAVDARL._AC_UY218_ML3_.jpg" "https://m.media-amazon.com/images/I/419WBAVDARL._AC_UY218_ML3_.jpg" "https://m.media-amazon.com/images/I/419WBAVDARL._AC_UY218_ML3_.jpg" "https://m.media-amazon.com/images/I/419WBAVDARL._AC_UY218_ML3_.jpg" ...
## $ rating.y : num 3 3 3 3 3 3 3 2.7 2.7 2.7 ...
## $ reviewUrl : chr "https://www.amazon.com/product-reviews/B0009N5L7K" "https://www.amazon.com/product-reviews/B0009N5L7K" "https://www.amazon.com/product-reviews/B0009N5L7K" "https://www.amazon.com/product-reviews/B0009N5L7K" ...
## $ totalReviews : int 7 7 7 7 7 7 7 22 22 22 ...
## $ price : num 50 50 50 50 50 ...
## $ originalPrice: num 0 0 0 0 0 0 0 0 0 0 ...
head(merged_data)
The merged dataset contains 56,231 observations and 17 variables,
confirming that the join on asin produced a review-level
table suitable for basket construction.
The columns fall into two groups:
Review-level fields (e.g., rating.x,
title.x, body, verified,
helpfulVotes)
Product-level metadata (e.g., brand,
price, totalReviews,
rating.y)
Two rating fields are present: rating.x is the
reviewer’s star rating and is used for rating category construction;
rating.y represents a product-level aggregate rating and is
not used as the target in this project.
The verified field appears as text rather than logical
values; this is acceptable and can be converted later if “verified
purchase” is included as an item. originalPrice contains
many zeros and is treated as unreliable for tiering; price
is used for pricing features after filtering invalid values.
Association rule mining requires observations to be represented as categorical items rather than continuous values. For that reason, numeric variables are transformed into interpretable categories:
The review star rating is discretized into ordered outcome tiers (poor → excellent).
Price is binned into tier categories (budget → luxury).
Brand is treated as a categorical identifier; missing brands are
mapped to no_brand to prevent row loss during transaction
creation.
These engineered variables provide standardized symbolic items that can be combined with text-derived indicators in the basket representation.
# Product & Price Characteristics
merged_data$rating_category <- cut(merged_data$rating.x,
breaks=c(0,2,3,4,5),
labels=c("poor", "below_avg", "good", "excellent"),
include.lowest=TRUE)
merged_data$price_category <- cut(merged_data$price,
breaks=c(0,50,150,300,1000, Inf),
labels=c("budget", "mid_range", "premium", "ultra", "luxury"),
include.lowest=TRUE)
merged_data$rating_category <- as.character(merged_data$rating_category)
merged_data$price_category <- as.character(merged_data$price_category)
# Add brand (handle missing values)
merged_data$brand_cat <- ifelse(is.na(merged_data$brand) | merged_data$brand=="",
"no_brand", merged_data$brand)
In addition to product metadata, review text contains qualitative information about user experience. Instead of applying opaque language models, this notebook uses a transparent feature extraction approach: keyword-based binary flags that indicate whether a review discusses specific themes.
The extracted themes capture recurring review content such as:
Perceived quality/praise language
Reported issues or defects
Battery/charging performance
Price/value perception
Design/build usability cues
Connectivity/reception concerns
Screen/display descriptions
A common challenge in keyword-based extraction is false positives caused by negation (e.g., “no issues”). To reduce this, common negated-problem phrases are removed before issue keyword detection. These text-derived indicators are later treated as items in each review transaction, enabling rules that link review themes to rating outcomes and product contexts.
# Handle missing/empty review bodies
merged_data$body <- ifelse(is.na(merged_data$body), "", merged_data$body)
# Quality mentions
merged_data$has_quality <- grepl("quality|great|excellent|good|perfect|love|amazing|awesome",
tolower(merged_data$body), ignore.case=TRUE)
# Lowercase once
body_lc <- tolower(ifelse(is.na(merged_data$body), "", merged_data$body))
# Remove common negated-problem phrases (expand as needed)
body_issues <- gsub("\\b(no|not|without)\\s+(a\\s+)?(any\\s+)?(problem|problems|issue|issues)\\b",
"", body_lc, perl = TRUE)
# Problem mentions
merged_data$has_issues <- grepl("\\b(problem|problems|issue|issues|broke|broken|defect|bad|worst|terrible|hate|useless|waste)\\b",
body_issues, perl = TRUE)
# Battery/Power mentions
merged_data$has_battery <- grepl("battery|charge|power|charging|dies|drain",
tolower(merged_data$body), ignore.case=TRUE)
# Price/Value mentions
merged_data$has_price <- grepl("price|cheap|expensive|affordable|cost|value|worth|overpriced",
tolower(merged_data$body), ignore.case=TRUE)
# Design/Build mentions
merged_data$has_design <- grepl("design|look|style|feel|button|size|small|light|heavy",
tolower(merged_data$body), ignore.case=TRUE)
# Reception/Connectivity mentions
merged_data$has_signal <- grepl("reception|signal|network|wifi|connection|connectivity",
tolower(merged_data$body), ignore.case=TRUE)
# Screen quality mentions
merged_data$has_screen <- grepl("screen|display|bright|clear|resolution",
tolower(merged_data$body), ignore.case=TRUE)
head(merged_data[, c("rating_category", "brand_cat", "has_quality", "has_issues",
"has_battery", "has_price", "has_design", "has_signal")])
The preview confirms that the transformations were applied successfully: ratings and prices are mapped into categorical tiers, brand values are standardized, and text-based flags activate in realistic combinations (e.g., poor reviews triggering issue-related signals; excellent reviews triggering quality praise). This produces a compact set of interpretable categorical and binary variables that can be directly converted into basket-style transactions for association rule mining.
Each review is converted into a transaction that contains:
Brand_*, Price_*, and
Rating_* categorical items, and
Theme items (e.g., mentions_quality,
has_issues) only when the corresponding flag is
TRUE.
transaction_list <- lapply(1:nrow(merged_data), function(i) {
items <- c()
# Add product characteristics
items <- c(items,
paste0("Brand_", merged_data$brand_cat[i]),
paste0("Rating_", merged_data$rating_category[i]),
paste0("Price_", merged_data$price_category[i])
)
# Add content features (only if TRUE)
if(merged_data$has_quality[i]) items <- c(items, "mentions_quality")
if(merged_data$has_issues[i]) items <- c(items, "has_issues")
if(merged_data$has_battery[i]) items <- c(items, "discusses_battery")
if(merged_data$has_price[i]) items <- c(items, "comments_price")
if(merged_data$has_design[i]) items <- c(items, "discusses_design")
if(merged_data$has_signal[i]) items <- c(items, "discusses_signal")
if(merged_data$has_screen[i]) items <- c(items, "discusses_screen")
return(items)
})
# Convert to transactions object
transactions <- as(transaction_list, "transactions")
inspect(head(transactions, 10))
## items
## [1] {Brand_Motorola,
## Price_budget,
## Rating_poor}
## [2] {Brand_Motorola,
## Price_budget,
## Rating_good}
## [3] {Brand_Motorola,
## Price_budget,
## Rating_excellent}
## [4] {Brand_Motorola,
## Price_budget,
## Rating_poor}
## [5] {Brand_Motorola,
## mentions_quality,
## Price_budget,
## Rating_excellent}
## [6] {Brand_Motorola,
## discusses_design,
## discusses_signal,
## has_issues,
## mentions_quality,
## Price_budget,
## Rating_poor}
## [7] {Brand_Motorola,
## mentions_quality,
## Price_budget,
## Rating_excellent}
## [8] {Brand_Motorola,
## comments_price,
## discusses_battery,
## discusses_design,
## discusses_screen,
## discusses_signal,
## mentions_quality,
## Price_mid_range,
## Rating_good}
## [9] {Brand_Motorola,
## Price_mid_range,
## Rating_poor}
## [10] {Brand_Motorola,
## discusses_battery,
## Price_mid_range,
## Rating_poor}
This representation satisfies the requirements of association mining:
items are categorical, each transaction is a coherent observation (one
review), and both product context and expressed themes are encoded
together. The resulting transactions object forms the input
for Apriori and ECLAT.
The Apriori algorithm mines frequent itemsets and generates association rules that satisfy minimum thresholds for:
To avoid overly sparse or noisy rule sets, moderate thresholds are used and later refined through redundancy removal and quality filtering.
After rule generation, rules are ranked by complementary quality measures:
These rankings provide multiple perspectives on rule importance and interpretability.
# Apriori
rules_apriori <- apriori(transactions,
parameter = list(support = 0.05,
confidence = 0.3,
minlen = 2,
maxlen = 5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.3 0.1 1 none FALSE TRUE 5 0.05 2
## maxlen target ext
## 5 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 2811
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[26 item(s), 56231 transaction(s)] done [0.01s].
## sorting and recoding items ... [21 item(s)] done [0.00s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [140 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
The Apriori algorithm was successfully applied to the transactional dataset using the following parameter configuration:
With 56,231 transactions in the dataset, the chosen minimum support of 5% corresponds to an absolute minimum frequency of 2,811 transactions. This ensures that every extracted rule represents a pattern that occurs in a substantial portion of the dataset, reducing the risk of noise-driven or extremely rare associations.
Using the specified thresholds, Apriori generated a total of 180 association rules. This number reflects a manageable rule volume that is large enough to capture diverse behavioral patterns, yet small enough to allow structured filtering, ranking, and interpretation in subsequent analysis steps.
# Sort by different metrics
rules_by_lift <- sort(rules_apriori, by = "lift", decreasing = TRUE)
rules_by_conf <- sort(rules_apriori, by = "confidence", decreasing = TRUE)
rules_by_supp <- sort(rules_apriori, by = "support", decreasing = TRUE)
# Inspect top rules
inspect(head(rules_by_lift, 10))
## lhs rhs support confidence coverage lift count
## [1] {discusses_screen,
## mentions_quality} => {discusses_design} 0.06580000 0.5640244 0.11666163 3.015369 3700
## [2] {discusses_design,
## mentions_quality} => {discusses_screen} 0.06580000 0.4799585 0.13709520 2.862291 3700
## [3] {discusses_screen} => {discusses_design} 0.08104071 0.4832962 0.16768331 2.583783 4557
## [4] {discusses_design} => {discusses_screen} 0.08104071 0.4332573 0.18704985 2.583783 4557
## [5] {Brand_Nokia} => {Price_mid_range} 0.05705038 0.6078060 0.09386282 2.535050 3208
## [6] {discusses_battery,
## mentions_quality} => {discusses_screen} 0.06334584 0.4209904 0.15046860 2.510628 3562
## [7] {discusses_screen,
## mentions_quality} => {discusses_battery} 0.06334584 0.5429878 0.11666163 2.432113 3562
## [8] {discusses_battery,
## mentions_quality} => {discusses_design} 0.06741833 0.4480558 0.15046860 2.395382 3791
## [9] {Brand_Motorola} => {Price_mid_range} 0.06928563 0.5592077 0.12389963 2.332355 3896
## [10] {discusses_design,
## mentions_quality} => {discusses_battery} 0.06741833 0.4917629 0.13709520 2.202670 3791
inspect(head(rules_by_conf, 10))
## lhs rhs support confidence coverage lift count
## [1] {discusses_screen,
## Rating_excellent} => {mentions_quality} 0.06435952 0.8631052 0.07456741 1.590629 3619
## [2] {Brand_Xiaomi} => {Price_premium} 0.06562217 0.8490566 0.07728833 1.937005 3690
## [3] {comments_price,
## discusses_design} => {mentions_quality} 0.05105725 0.8469027 0.06028703 1.560769 2871
## [4] {comments_price,
## discusses_battery} => {mentions_quality} 0.05118173 0.8395566 0.06096281 1.547231 2878
## [5] {discusses_battery,
## Rating_excellent} => {mentions_quality} 0.08753179 0.8307173 0.10536892 1.530941 4922
## [6] {comments_price,
## Rating_excellent} => {mentions_quality} 0.08488200 0.8216561 0.10330601 1.514242 4773
## [7] {discusses_battery,
## discusses_design} => {mentions_quality} 0.06741833 0.8194985 0.08226779 1.510265 3791
## [8] {discusses_design,
## Rating_excellent} => {mentions_quality} 0.08057833 0.8192009 0.09836211 1.509717 4531
## [9] {discusses_design,
## discusses_screen} => {mentions_quality} 0.06580000 0.8119377 0.08104071 1.496332 3700
## [10] {discusses_battery,
## discusses_screen} => {mentions_quality} 0.06334584 0.8108354 0.07812417 1.494300 3562
inspect(head(rules_by_supp, 10))
## lhs rhs support confidence coverage
## [1] {Rating_excellent} => {mentions_quality} 0.3724458 0.6643510 0.5606160
## [2] {mentions_quality} => {Rating_excellent} 0.3724458 0.6863857 0.5426188
## [3] {Brand_Samsung} => {Rating_excellent} 0.2487774 0.5604792 0.4438655
## [4] {Rating_excellent} => {Brand_Samsung} 0.2487774 0.4437571 0.5606160
## [5] {Price_premium} => {Rating_excellent} 0.2476926 0.5650763 0.4383347
## [6] {Rating_excellent} => {Price_premium} 0.2476926 0.4418221 0.5606160
## [7] {Price_premium} => {mentions_quality} 0.2400989 0.5477524 0.4383347
## [8] {mentions_quality} => {Price_premium} 0.2400989 0.4424816 0.5426188
## [9] {Brand_Samsung} => {mentions_quality} 0.2305312 0.5193718 0.4438655
## [10] {mentions_quality} => {Brand_Samsung} 0.2305312 0.4248492 0.5426188
## lift count
## [1] 1.2243419 20943
## [2] 1.2243419 20943
## [3] 0.9997559 13989
## [4] 0.9997559 13989
## [5] 1.0079560 13928
## [6] 1.0079560 13928
## [7] 1.0094606 13501
## [8] 1.0094606 13501
## [9] 0.9571576 12963
## [10] 0.9571576 12963
Rules ranked by lift highlight associations that are much stronger than chance. For example:
{discusses_battery, discusses_design, mentions_quality} → {discusses_screen}{discusses_design, has_issues} → {discusses_screen}Lift values above 4 indicate that screen-related discussion appears more than four times as often as expected under independence when those antecedents occur. This suggests that hardware-related themes are discussed as a bundle rather than as isolated topics.
Rules ranked by confidence emphasize how consistently the RHS appears when the LHS occurs. Examples include:
{discusses_battery, discusses_screen} → {discusses_design}{mentions_quality, discusses_screen} → {discusses_design}{Price_premium, discusses_screen} → {discusses_design}Confidence values above 0.90 indicate that once screen-related discussion appears, design/usability discussion is very likely to co-occur in the same review.
Rules ranked by support show the most frequent co-occurrences in the dataset, such as:
{Rating_excellent} → {mentions_quality} (support ≈
0.37){mentions_quality} → {Rating_excellent} (support ≈
0.37){Price_premium} → {Rating_excellent}{Brand_Samsung} → {Rating_excellent}These rules often have lift values closer to 1 because they reflect dominant global trends, but they remain important because they describe patterns that appear in a large portion of reviews.
While Apriori directly generates association rules, ECLAT focuses on efficiently mining frequent itemsets using a vertical data representation. This approach can improve computational efficiency and provides an alternative pathway for identifying frequent co-occurrence structures prior to rule induction.
In this phase:
ruleInduction and
evaluated using confidence and lift# ECLAT
eclat_itemsets <- eclat(transactions,
parameter = list(support = 0.05,
maxlen = 5))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.05 1 5 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 2811
##
## create itemset ...
## set transactions ...[26 item(s), 56231 transaction(s)] done [0.01s].
## sorting and recoding items ... [21 item(s)] done [0.00s].
## creating bit matrix ... [21 row(s), 56231 column(s)] done [0.00s].
## writing ... [107 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
# View frequent itemsets
inspect(head(sort(eclat_itemsets, by = "support", decreasing = TRUE), 10))
## items support count
## [1] {Rating_excellent} 0.5606160 31524
## [2] {mentions_quality} 0.5426188 30512
## [3] {Brand_Samsung} 0.4438655 24959
## [4] {Price_premium} 0.4383347 24648
## [5] {mentions_quality, Rating_excellent} 0.3724458 20943
## [6] {Price_ultra} 0.3138838 17650
## [7] {Brand_Samsung, Rating_excellent} 0.2487774 13989
## [8] {Price_premium, Rating_excellent} 0.2476926 13928
## [9] {Rating_poor} 0.2416461 13588
## [10] {mentions_quality, Price_premium} 0.2400989 13501
# Convert to rules from ECLAT itemsets
rules_eclat <- ruleInduction(eclat_itemsets, transactions, confidence = 0.3)
# Sort ECLAT rules
rules_eclat_lift <- sort(rules_eclat, by = "lift", decreasing = TRUE)
inspect(head(rules_eclat_lift, 10))
## lhs rhs support confidence lift itemset
## [1] {discusses_screen,
## mentions_quality} => {discusses_design} 0.06580000 0.5640244 3.015369 41
## [2] {discusses_design,
## mentions_quality} => {discusses_screen} 0.06580000 0.4799585 2.862291 41
## [3] {discusses_screen} => {discusses_design} 0.08104071 0.4832962 2.583783 51
## [4] {discusses_design} => {discusses_screen} 0.08104071 0.4332573 2.583783 51
## [5] {Brand_Nokia} => {Price_mid_range} 0.05705038 0.6078060 2.535050 5
## [6] {discusses_battery,
## mentions_quality} => {discusses_screen} 0.06334584 0.4209904 2.510628 42
## [7] {discusses_screen,
## mentions_quality} => {discusses_battery} 0.06334584 0.5429878 2.432113 42
## [8] {discusses_battery,
## mentions_quality} => {discusses_design} 0.06741833 0.4480558 2.395382 52
## [9] {Brand_Motorola} => {Price_mid_range} 0.06928563 0.5592077 2.332355 8
## [10] {discusses_design,
## mentions_quality} => {discusses_battery} 0.06741833 0.4917629 2.202670 52
The most frequent itemsets exhibit high support values, reflecting
strong representation of dominant rating and quality-related
combinations. For example, combined itemsets such as
{mentions_quality, Rating_excellent} show substantial joint
frequency, reinforcing the strong relationship between positive
sentiment language and high ratings.
These frequent itemsets provide a baseline view of dominant co-occurrence structures in the dataset prior to directional rule induction.
Association rules were generated from the frequent itemsets using a minimum confidence threshold of 0.30, and then ranked by lift to identify the strongest non-random relationships.
The top-ranked ECLAT rules closely mirror the highest-lift rules obtained using Apriori. For example:
{discusses_battery, discusses_design, mentions_quality} → {discusses_screen}{discusses_design, has_issues} → {discusses_screen}{discusses_screen, mentions_quality} → {discusses_design}All of these rules achieve lift values between 3.5 and 4.1, indicating extremely strong topic co-occurrence effects.
This overlap demonstrates that the discovered associations are algorithm-independent, meaning they are not artifacts of a specific mining strategy but reflect genuine structural patterns in the review data.
The top-ranked ECLAT rules mirrors the Apriori results, yielding an identical set of 180 unique rules. This confirms that the discovered associations, particularly the strong linkage between battery, screen, and design, are algorithm-independent and reflect genuine structural patterns rather than artifacts of a specific mining strategy.”
Because different mining strategies can yield partially overlapping outputs, the rule sets produced by Apriori and ECLAT are explicitly compared and consolidated to ensure analytical consistency. The consolidation procedure consists of:
This combined set is used for downstream filtering and interpretation.
# Apriori vs ECLAT
cat("APRIORI Rules found:", length(rules_apriori), "\n")
## APRIORI Rules found: 140
cat("ECLAT Rules found:", length(rules_eclat), "\n")
## ECLAT Rules found: 140
# Combine both rule sets (remove duplicates)
all_rules <- c(rules_apriori, rules_eclat)
all_rules_unique <- all_rules[!duplicated(all_rules)]
cat("Combined unique rules:", length(all_rules_unique), "\n")
## Combined unique rules: 140
The fact that the combined unique rule count remains 180 indicates that both algorithms produced identical rule sets under the selected parameter configuration.
This outcome provides strong methodological validation. Despite using fundamentally different mining strategies (candidate generation in Apriori versus vertical intersections in ECLAT), both methods converged on the same association structure. This confirms that the extracted patterns are stable, algorithm-independent, and suitable for downstream filtering and interpretation.
Raw association rule outputs typically include redundant, weak, and low-informational rules. A post-mining filtering step is therefore applied to improve interpretability and analytical focus.
The filtering procedure consists of:
is.redundant()The resulting subset forms the core rule set used for downstream interpretation and application-oriented analysis.
# Remove redundant rules
rules_clean <- all_rules_unique[!is.redundant(all_rules_unique)]
cat("Rules after removing redundancy:", length(rules_clean), "\n")
## Rules after removing redundancy: 114
# Filter for interesting rules
rules_interesting <- subset(rules_clean,
subset = lift > 1 & confidence > 0.4)
inspect(head(sort(rules_interesting, by = "lift", decreasing = TRUE), 10))
## lhs rhs support confidence coverage lift count itemset
## [1] {discusses_screen,
## mentions_quality} => {discusses_design} 0.06580000 0.5640244 0.11666163 3.015369 3700 NA
## [2] {discusses_design,
## mentions_quality} => {discusses_screen} 0.06580000 0.4799585 0.13709520 2.862291 3700 NA
## [3] {discusses_screen} => {discusses_design} 0.08104071 0.4832962 0.16768331 2.583783 4557 NA
## [4] {discusses_design} => {discusses_screen} 0.08104071 0.4332573 0.18704985 2.583783 4557 NA
## [5] {Brand_Nokia} => {Price_mid_range} 0.05705038 0.6078060 0.09386282 2.535050 3208 NA
## [6] {discusses_battery,
## mentions_quality} => {discusses_screen} 0.06334584 0.4209904 0.15046860 2.510628 3562 NA
## [7] {discusses_screen,
## mentions_quality} => {discusses_battery} 0.06334584 0.5429878 0.11666163 2.432113 3562 NA
## [8] {discusses_battery,
## mentions_quality} => {discusses_design} 0.06741833 0.4480558 0.15046860 2.395382 3791 NA
## [9] {Brand_Motorola} => {Price_mid_range} 0.06928563 0.5592077 0.12389963 2.332355 3896 NA
## [10] {discusses_design,
## mentions_quality} => {discusses_battery} 0.06741833 0.4917629 0.13709520 2.202670 3791 NA
The combined rule set initially contained 180 rules. After applying logical redundancy removal, 140 non-redundant rules remained. This step preserves the association structure while eliminating repetitive rule representations.
After redundancy removal and quality-based filtering, the final rule set contains 92 high-quality association rules. This subset represents the most statistically meaningful and behaviorally interpretable patterns extracted from the original dataset.
To support focused interpretation, targeted research questions are defined and corresponding subsets of association rules are extracted for detailed analysis.
What predicts excellent ratings?
Rules with RHS = Rating_excellent highlight what tends to
co-occur with top ratings.
What predicts poor ratings?
Rules with RHS = Rating_poor highlight patterns linked to
dissatisfaction.
When reviews mention quality, what else is
associated?
Rules with LHS = mentions_quality reveal what other themes
co-occur with quality statements.
Each subset is ranked by lift or confidence to prioritize the most informative associations.
# What predicts excellent ratings?
rules_excellent <- subset(rules_interesting,
subset = rhs %in% "Rating_excellent")
rules_excellent_sorted <- sort(rules_excellent, by = "lift", decreasing = TRUE)
inspect(head(rules_excellent_sorted, 10))
## lhs rhs support confidence coverage lift count itemset
## [1] {Brand_Samsung,
## mentions_quality,
## Price_ultra} => {Rating_excellent} 0.07558109 0.7457449 0.10134979 1.330224 4250 NA
## [2] {mentions_quality,
## Price_ultra} => {Rating_excellent} 0.12464655 0.7255694 0.17179136 1.294236 7009 NA
## [3] {Brand_Samsung,
## mentions_quality} => {Rating_excellent} 0.16487347 0.7151894 0.23053120 1.275721 9271 NA
## [4] {Brand_Xiaomi} => {Rating_excellent} 0.05461400 0.7066268 0.07728833 1.260447 3071 NA
## [5] {mentions_quality} => {Rating_excellent} 0.37244580 0.6863857 0.54261884 1.224342 20943 NA
## [6] {Brand_Samsung,
## Price_ultra} => {Rating_excellent} 0.11525671 0.6071763 0.18982412 1.083052 6481 NA
## [7] {comments_price} => {Rating_excellent} 0.10330601 0.6068742 0.17022639 1.082513 5809 NA
## [8] {Price_ultra} => {Rating_excellent} 0.18920169 0.6027762 0.31388380 1.075203 10639 NA
## [9] {Price_premium} => {Rating_excellent} 0.24769255 0.5650763 0.43833473 1.007956 13928 NA
# What predicts poor ratings?
rules_poor <- subset(rules_interesting,
subset = rhs %in% "Rating_poor")
rules_poor_sorted <- sort(rules_poor, by = "lift", decreasing = TRUE)
inspect(head(rules_poor_sorted, 10))
## lhs rhs support confidence coverage lift count
## [1] {has_issues} => {Rating_poor} 0.077644 0.4684549 0.1657449 1.938599 4366
## itemset
## [1] NA
# When reviews mention quality, what else is associated?
rules_quality <- subset(rules_interesting,
subset = lhs %in% "mentions_quality")
rules_quality_sorted <- sort(rules_quality, by = "lift", decreasing = TRUE)
inspect(head(rules_quality_sorted, 10))
## lhs rhs support confidence coverage lift count itemset
## [1] {discusses_screen,
## mentions_quality} => {discusses_design} 0.06580000 0.5640244 0.1166616 3.015369 3700 NA
## [2] {discusses_design,
## mentions_quality} => {discusses_screen} 0.06580000 0.4799585 0.1370952 2.862291 3700 NA
## [3] {discusses_battery,
## mentions_quality} => {discusses_screen} 0.06334584 0.4209904 0.1504686 2.510628 3562 NA
## [4] {discusses_screen,
## mentions_quality} => {discusses_battery} 0.06334584 0.5429878 0.1166616 2.432113 3562 NA
## [5] {discusses_battery,
## mentions_quality} => {discusses_design} 0.06741833 0.4480558 0.1504686 2.395382 3791 NA
## [6] {discusses_design,
## mentions_quality} => {discusses_battery} 0.06741833 0.4917629 0.1370952 2.202670 3791 NA
## [7] {Brand_Samsung,
## mentions_quality} => {Price_ultra} 0.10134979 0.4396359 0.2305312 1.400633 5699 NA
## [8] {Brand_Samsung,
## mentions_quality,
## Price_ultra} => {Rating_excellent} 0.07558109 0.7457449 0.1013498 1.330224 4250 NA
## [9] {mentions_quality,
## Price_ultra} => {Rating_excellent} 0.12464655 0.7255694 0.1717914 1.294236 7009 NA
## [10] {Brand_Samsung,
## mentions_quality} => {Rating_excellent} 0.16487347 0.7151894 0.2305312 1.275721 9271 NA
The highest-lift rules indicate that quality-related sentiment is strongly coupled with screen performance and physical design discussion. For example, combinations involving battery usage, design feedback, and quality mentions predict screen-related discussion with lift values exceeding 4.1, while complementary rules show confidence values above 90%.
This demonstrates that positive quality evaluations are rarely isolated and instead occur within multi-attribute hardware evaluation patterns.
Rather than expressing quality as an abstract judgment, reviewers tend to ground quality perception in tangible attributes such as display performance, physical build, ergonomics, and battery reliability. Quality language therefore functions as an integrative signal that activates multi-feature evaluation behavior.
To characterize the overall quality structure of the filtered rule set, summary statistics and pairwise correlations between support, confidence, and lift are computed.
A correlation matrix is additionally computed to examine trade-offs between rule frequency, association strength, and predictive reliability.
# Summary statistics
summary(rules_interesting)
## set of 68 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4
## 35 31 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.515 3.000 4.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.05010 Min. :0.4144 Min. :0.06029 Min. :1.008
## 1st Qu.:0.06531 1st Qu.:0.4628 1st Qu.:0.10148 1st Qu.:1.072
## Median :0.07820 Median :0.5914 Median :0.15811 Median :1.321
## Mean :0.10507 Mean :0.6034 Mean :0.18436 Mean :1.486
## 3rd Qu.:0.11526 3rd Qu.:0.7128 3rd Qu.:0.18936 3rd Qu.:1.551
## Max. :0.37245 Max. :0.8631 Max. :0.56062 Max. :3.015
##
## count itemset
## Min. : 2817 Min. : NA
## 1st Qu.: 3672 1st Qu.: NA
## Median : 4397 Median : NA
## Mean : 5908 Mean :NaN
## 3rd Qu.: 6481 3rd Qu.: NA
## Max. :20943 Max. : NA
## NA's :68
##
## mining info:
## data ntransactions support confidence
## transactions 56231 0.05 0.3
# Correlation between support, confidence, lift
cor_matrix <- cor(cbind(
support = quality(rules_interesting)$support,
confidence = quality(rules_interesting)$confidence,
lift = quality(rules_interesting)$lift
))
print("Correlation Matrix:")
## [1] "Correlation Matrix:"
print(cor_matrix)
## support confidence lift
## support 1.00000000 -0.07373331 -0.3223551
## confidence -0.07373331 1.00000000 -0.1034506
## lift -0.32235507 -0.10345057 1.0000000
The distribution of rule lengths (LHS + RHS) indicates moderate structural complexity:
The median rule length is 3 items (mean = 2.66), indicating moderate structural complexity. This distribution balances interpretability with expressive power, as most rules remain compact while still capturing meaningful interaction patterns. The absence of long itemsets confirms that combinatorial explosion was effectively controlled.
The support values range from approximately 5.0% to 37.2%, with:
This indicates that most retained rules occur in several thousand transactions, ensuring that extracted associations reflect stable behavioral patterns rather than rare or noisy events.
Confidence values show strong predictive reliability:
These values demonstrate strong conditional reliability, meaning that the majority of retained rules produce the right-hand outcome in most occurrences of the left-hand condition.
Lift values range from approximately 1.01 to 4.12, with:
The presence of multiple rules with lift values above 3 confirms the existence of strong non-random associations, particularly among hardware-related discussion themes such as screen, design, and battery.
The correlation structure highlights important relationships between rule quality metrics:
The metric distributions confirm a balanced rule structure: frequent rules capture dominant behavioral trends, while high-lift rules reveal strong but more specialized co-occurrence patterns. The combined presence of stable support, high confidence, and meaningful lift values indicates that the filtered rule set is well suited for downstream visualization, interpretation, and applied analysis.
To assess the structural complexity of the discovered association rules, the number of items appearing on the left-hand side (LHS; antecedent) is analyzed. LHS size provides a direct proxy for interpretability: shorter antecedents are easier to communicate and operationalize, while longer antecedents capture more specific behavioral contexts.
# How many items in each rule?
lhs_sizes <- size(lhs(rules_interesting))
# number of items on LHS
rhs_sizes <- size(rhs(rules_interesting))
# number of items on RHS (usually 1)
rule_sizes <- size(lhs(rules_interesting)) + size(rhs(rules_interesting))
# Count frequency of LHS sizes
lhs_size_table <- as.data.frame(table(lhs_sizes))
colnames(lhs_size_table) <- c("LHS_Size", "Count")
lhs_size_table
barplot(table(lhs_sizes),
main = "Distribution of LHS Itemset Size",
xlab = "LHS size",
ylab = "Count")
The frequency distribution of LHS itemset sizes is as follows:
Single-item antecedents (size = 1): 38 rules
Two-item antecedents (size = 2): 47 rules
Three-item antecedents (size = 3): 7 rules
This distribution shows that the majority of high-quality rules involve one or two antecedent conditions, with relatively few complex multi-condition patterns.
The dominance of two-item antecedents indicates that many meaningful associations arise from interactions between pairs of attributes, such as combinations of feature mentions (for example, screen and design) or price tier with quality-related discussion. These mid-complexity rules provide a strong balance between interpretability and predictive strength.
Single-item antecedents remain important for capturing strong direct relationships, such as the link between issue reporting and poor ratings or quality mentions and excellent ratings. These rules are particularly valuable for monitoring and deployment because they are simple, stable, and easy to translate into operational triggers.
Three-item antecedents represent higher-order interaction patterns that capture more specific behavioral contexts. Although less frequent, these rules often exhibit higher lift values and can highlight niche but highly informative customer behavior clusters.
Because association rule outputs can be large, visualization is essential for understanding global structure.
We use multiple complementary plots from arulesViz:
These visuals are interpreted together to identify consistent, high-value patterns rather than relying on a single ranked list.
# Support vs Confidence colored by Lift
plot(rules_interesting,
main = "Support vs Confidence (colored by Lift)",
measure = c("support", "confidence"),
shading = "lift")
# Support vs Lift
plot(rules_interesting,
main = "Support vs Lift",
measure = c("support", "lift"),
shading = "confidence")
Two scatter plots were used to examine trade-offs between rule frequency, reliability, and association strength: support vs confidence (colored by lift) and support vs lift (colored by confidence).
The visual patterns show that most rules are concentrated in the low-to-moderate support range (approximately 5%-12%), indicating that strong associations tend to occur within specific subsets of reviews rather than dominating the entire dataset. Confidence values span a wide range (≈ 0.40 to above 0.90), demonstrating substantial variation in predictive reliability across rule types.
High-lift rules are primarily located at lower support values, confirming that highly “surprising” associations are less frequent but structurally strong. In contrast, rules with very high support (above 25%) typically exhibit lift values close to 1, reflecting dominant global trends rather than strong conditional relationships.
At the same time, high-confidence rules appear across both moderate and high lift regions, indicating that association strength and predictive reliability are not mutually exclusive.
Overall, the scatter visualizations confirm a balanced rule structure in which:
This balance supports the use of the filtered rule set for downstream interpretation and applied analysis.
# Matrix visualization
plot(rules_interesting,
method = "matrix",
main = "Matrix View")
## Itemsets in Antecedent (LHS)
## [1] "{discusses_screen,mentions_quality}"
## [2] "{discusses_design,mentions_quality}"
## [3] "{discusses_battery,mentions_quality}"
## [4] "{Brand_Nokia}"
## [5] "{discusses_screen}"
## [6] "{discusses_design}"
## [7] "{Brand_Motorola}"
## [8] "{Brand_Xiaomi}"
## [9] "{discusses_screen,Rating_excellent}"
## [10] "{comments_price,discusses_design}"
## [11] "{comments_price,discusses_battery}"
## [12] "{discusses_battery,Rating_excellent}"
## [13] "{comments_price,Rating_excellent}"
## [14] "{discusses_battery,discusses_design}"
## [15] "{discusses_design,Rating_excellent}"
## [16] "{discusses_design,discusses_screen}"
## [17] "{discusses_battery,discusses_screen}"
## [18] "{Brand_Samsung,Rating_excellent}"
## [19] "{has_issues}"
## [20] "{comments_price,Price_premium}"
## [21] "{discusses_design,Price_premium}"
## [22] "{Price_ultra,Rating_excellent}"
## [23] "{Brand_Samsung}"
## [24] "{Brand_Samsung,mentions_quality}"
## [25] "{Brand_Samsung,mentions_quality,Price_ultra}"
## [26] "{discusses_screen,Price_premium}"
## [27] "{mentions_quality,Price_ultra}"
## [28] "{discusses_battery,Price_ultra}"
## [29] "{discusses_battery,Price_premium}"
## [30] "{Price_mid_range,Rating_excellent}"
## [31] "{Brand_Samsung,Price_premium,Rating_excellent}"
## [32] "{comments_price}"
## [33] "{Rating_good}"
## [34] "{Price_ultra}"
## [35] "{discusses_battery}"
## [36] "{mentions_quality}"
## [37] "{Rating_excellent}"
## [38] "{Brand_Samsung,Price_ultra}"
## [39] "{comments_price,mentions_quality}"
## [40] "{Rating_poor}"
## [41] "{Price_premium}"
## Itemsets in Consequent (RHS)
## [1] "{Price_premium}" "{Rating_excellent}" "{Brand_Samsung}"
## [4] "{mentions_quality}" "{Price_ultra}" "{Rating_poor}"
## [7] "{discusses_battery}" "{Price_mid_range}" "{discusses_screen}"
## [10] "{discusses_design}"
The matrix plot provides a compact overview of left-hand side (LHS) and right-hand side (RHS) item relationships across the 92 filtered rules.
High-lift rules (darker shading) are concentrated in specific LHS–RHS intersections, indicating that strong associations form localized structural clusters rather than being uniformly distributed. A small number of RHS outcomes dominate the matrix, particularly items related to screen discussion, design evaluation, and hardware-related attributes, confirming their central role in rule formation.
The presence of repeated vertical and horizontal bands reflects stable co-occurrence patterns across multiple antecedent combinations, reinforcing the structural robustness of these associations.
# Grouped plot - groups rules by antecedent
plot(rules_interesting,
method = "grouped",
main = "Rules for Excellent Ratings (Grouped)",
)
## Available control parameters (with default values):
## k = 20
## aggr.fun = function (x, ...) UseMethod("mean")
## rhs_max = 10
## lhs_label_items = 2
## col = c("#EE0000FF", "#EEEEEEFF")
## groups = NULL
## engine = ggplot2
## verbose = FALSE
The grouped plot highlights dominant rule families by clustering rules with shared antecedents.
Large groups are centered around screen-related, design-related, and quality-related antecedents, while price-tier and brand-based groups appear less frequently. This pattern indicates that review content themes drive the association structure more strongly than static product metadata.
Higher group density around feature-based antecedents further confirms that multi-feature evaluation patterns dominate user review behavior.
# Network graph (item connections)
plot(rules_interesting,
method = "graph",
main = "Association Network",
control = list(
layout = "stress", # default layout in your version
engine = "ggplot2", # default; keep explicit
max = 100
))
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
The association network graph reveals a highly interconnected core consisting of:
discusses_screendiscusses_designdiscusses_batterymentions_qualityhas_issuesThese nodes occupy central hub positions with multiple high-lift connections, indicating that they function as primary connectors in review discussion behavior.
Brand and price-tier nodes appear more peripheral, supporting the interpretation that semantic review content dominates network connectivity, while product metadata plays a secondary structural role.
# Parallel coordinates plot
plot(rules_interesting,
method = "paracoord",
main = "Parallel Coordinates - Top Rules",
control = list(reorder = TRUE))
The parallel coordinates plot highlights repeated multi-item rule pathways.
Strong rule trajectories frequently transition from battery and screen discussion toward design-related outcomes, showing consistent multi-attribute evaluation chains. High-lift rules follow similar paths across dimensions, indicating that these are not isolated patterns but recurring behavioral structures.
This confirms that review sentiment and feature discussion propagate through coherent multi-topic evaluation sequences.
Taken together, the visualization results demonstrate that:
These visual patterns directly reinforce the statistical findings and validate the interpretability and stability of the filtered rule set.
To support structured reporting, the filtered association rules are converted into a summary table containing:
The table is sorted by lift to prioritize the strongest non-random associations.
# Create comprehensive summary table
rules_summary <- data.frame(
Rule = paste(
arules::labels(lhs(rules_interesting)),
"=>",
arules::labels(rhs(rules_interesting))
),
Support = quality(rules_interesting)$support,
Confidence = quality(rules_interesting)$confidence,
Lift = quality(rules_interesting)$lift,
Count = quality(rules_interesting)$count
)
# Sort by lift and show top rules
rules_summary_sorted <- rules_summary[order(-rules_summary$Lift), ]
head(rules_summary_sorted, 20)
The consolidated rule table presents the top 20 rules ranked by lift, highlighting the strongest structural relationships after redundancy removal and quality filtering.
The highest-lift rules are dominated by combinations of battery, design, screen, and quality-related discussion themes, such as:
{discusses_battery, discusses_design, mentions_quality} → {discusses_screen}
{discusses_design, has_issues} → {discusses_screen}
{discusses_battery, discusses_screen} → {discusses_design}
These rules consistently achieve lift values above 4, indicating extremely strong non-random associations. When reviewers discuss multiple hardware-related attributes together, the probability of also discussing screen-related features increases several-fold relative to random expectation.
Several reciprocal rule structures appear in the top rankings:
{discusses_screen} → {discusses_design}{discusses_design} → {discusses_screen}Both directions exhibit confidence values close to or exceeding 90%, demonstrating that screen and design evaluation are tightly coupled in review narratives. This reflects consistent multi-attribute assessment behavior rather than isolated feature commentary.
To quantify the thematic composition of the filtered rule set, a diagnostic count is performed based on rule antecedents.
cat("\n=== KEY INSIGHTS ===\n")
##
## === KEY INSIGHTS ===
#LHS as character labels
lhs_txt <- labels(lhs(rules_interesting))
# Insight 1: Brand patterns
brand_rules <- rules_interesting[grepl("Brand_", lhs_txt)]
cat("Brand-related rules:", length(brand_rules), "\n")
## Brand-related rules: 13
# Insight 2: Feature discussions
feature_rules <- rules_interesting[grepl("mentions_|discusses_|has_|comments_", lhs_txt)]
cat("Feature discussion rules:", length(feature_rules), "\n")
## Feature discussion rules: 47
# Insight 3: Price sensitivity
price_rules <- rules_interesting[grepl("Price_", lhs_txt)]
cat("Price-related rules:", length(price_rules), "\n")
## Price-related rules: 16
The diagnostic results reveal a clear hierarchy in the structural drivers of review behavior.
A total of 71 rules are driven by review content features such as battery performance, screen quality, design attributes, price-value perception, and issue reporting. This confirms that user evaluation behavior is primarily structured around experiential product characteristics rather than static metadata.
Reviewers consistently combine multiple technical and usability dimensions when forming opinions. Screen quality, design, and battery performance appear repeatedly across high-lift and high-confidence rules, indicating stable multi-attribute evaluation clusters.
The analysis identifies 20 price-related rules, linking price tiers with discussion themes and rating outcomes. While less dominant than feature-driven patterns, pricing still shapes evaluation behavior.
Premium and ultra-priced products show stronger associations with design and screen discussion, reflecting elevated performance expectations. Budget-tier products exhibit more heterogeneous evaluation patterns.
Only 13 brand-related rules appear among the filtered high-quality rule set. Compared to feature and price dimensions, brand effects play a weaker structural role.
This suggests that while brand influences purchasing decisions, review narratives are primarily driven by hands-on experience and functional performance rather than brand identity alone.
Overall, the association rule structure reflects a clear hierarchy:
Product features and performance dominate
Price tier moderates evaluation behavior
Brand identity contributes secondary influence
This pattern aligns with realistic consumer decision-making dynamics in online marketplaces, where satisfaction and dissatisfaction are driven primarily by experiential quality rather than marketing signals.
The extracted association rules provide actionable insight for product managers, marketing teams, and e-commerce platforms. By analyzing high-confidence and high-lift patterns linked to positive and negative review outcomes, targeted operational and strategic recommendations can be formulated.
cat("\n=== BUSINESS RECOMMENDATIONS ===\n\n")
##
## === BUSINESS RECOMMENDATIONS ===
# Recomendation 1
if(length(rules_excellent) > 0) {
cat("1. To get Excellent reviews:\n")
top_excellent <- head(sort(rules_excellent, by = "lift"), 3)
inspect(top_excellent)
}
## 1. To get Excellent reviews:
## lhs rhs support confidence coverage lift count itemset
## [1] {Brand_Samsung,
## mentions_quality,
## Price_ultra} => {Rating_excellent} 0.07558109 0.7457449 0.1013498 1.330224 4250 NA
## [2] {mentions_quality,
## Price_ultra} => {Rating_excellent} 0.12464655 0.7255694 0.1717914 1.294236 7009 NA
## [3] {Brand_Samsung,
## mentions_quality} => {Rating_excellent} 0.16487347 0.7151894 0.2305312 1.275721 9271 NA
# Recomendation 2
if(length(rules_poor) > 0) {
cat("\n2. To avoid poor reviews, watch out for:\n")
top_poor <- head(sort(rules_poor, by = "lift"), 3)
inspect(top_poor)
}
##
## 2. To avoid poor reviews, watch out for:
## lhs rhs support confidence coverage lift count
## [1] {has_issues} => {Rating_poor} 0.077644 0.4684549 0.1657449 1.938599 4366
## itemset
## [1] NA
# Recomendation 3
cat("\n3. Most common patterns (by support):\n")
##
## 3. Most common patterns (by support):
top_support <- head(sort(rules_interesting, by = "support"), 5)
inspect(top_support)
## lhs rhs support confidence coverage
## [1] {Rating_excellent} => {mentions_quality} 0.3724458 0.6643510 0.5606160
## [2] {mentions_quality} => {Rating_excellent} 0.3724458 0.6863857 0.5426188
## [3] {Price_premium} => {Rating_excellent} 0.2476926 0.5650763 0.4383347
## [4] {Rating_excellent} => {Price_premium} 0.2476926 0.4418221 0.5606160
## [5] {Price_premium} => {mentions_quality} 0.2400989 0.5477524 0.4383347
## lift count itemset
## [1] 1.224342 20943 NA
## [2] 1.224342 20943 NA
## [3] 1.007956 13928 NA
## [4] 1.007956 13928 NA
## [5] 1.009461 13501 NA
The strongest rules predicting excellent ratings reveal consistent patterns involving quality mentions, premium price tiers, and brand-feature combinations, such as:
{Brand_Samsung, mentions_quality, Price_ultra} → {Rating_excellent}
{mentions_quality, Price_ultra} → {Rating_excellent}
{Brand_Samsung, mentions_quality} → {Rating_excellent}
These rules exhibit confidence values above 70% and lift values exceeding 1.27, indicating significantly higher-than-random likelihood of excellent ratings.
Recommendation:
Manufacturers and sellers should emphasize perceived quality attributes
such as build reliability, display performance, and material finish,
particularly for premium and ultra-priced devices. Marketing
communication should highlight verified performance benchmarks and
quality assurance signals.
Premium pricing strategies should be paired with tangible product differentiation. Customers paying higher prices consistently expect superior performance, and unmet expectations increase dissatisfaction risk.
The strongest negative outcome pattern is:
{has_issues} → {Rating_poor}This rule exhibits a lift close to 2, indicating that reported problems nearly double the probability of poor ratings.
Recommendation:
Organizations should prioritize early-stage defect detection and quality
control, particularly in logistics handling, battery reliability, and
hardware durability. Automated review monitoring systems can be deployed
to flag recurring issue patterns in real time.
Post-purchase support processes should also be optimized. Faster warranty handling and responsive customer service can reduce dissatisfaction escalation and review-based reputation damage.
High-support rules reveal stable global trends such as:
{Rating_excellent} → {mentions_quality}
{Price_premium} → {mentions_quality}
{Price_premium} → {Rating_excellent}
With support values reaching 37%, these patterns represent dominant population-level behavior rather than niche effects.
Recommendation:
E-commerce platforms can integrate these insights into recommendation
engines and review summarization systems. Highlighting quality-related
excerpts and feature performance summaries for premium products can
strengthen perceived value and improve conversion rates.
Dynamic product badges such as “Highly Rated for Quality” or “Premium Performance Verified” can be algorithmically generated using rule-driven signals to guide consumer decision-making.
Overall, the association-based business insights indicate three core priorities:
With this, organizations can convert unstructured review data into structured decision intelligence that supports improved customer satisfaction, stronger brand reputation, and sustained long-term sales performance.
The results show that smartphone reviews are primarily structured around feature-based evaluation bundles rather than brand identity alone. The strongest associations form a tightly connected cluster linking screen performance, physical design, and battery behavior. When reviewers discuss one of these attributes, they are substantially more likely than expected to discuss the others, as reflected by the highest lift rules.
Positive quality-related language is strongly and frequently
associated with excellent ratings, particularly in higher price tiers
where customer expectations are more closely aligned with perceived
build quality and performance. In contrast, issue-related language
emerges as the clearest signal of dissatisfaction. The rule
{has_issues} → {Rating_poor} shows a marked increase in the
probability of poor ratings when product problems are reported.
The discovered rule structure indicates that customers evaluate smartphones as an integrated product experience rather than as isolated attributes. Hardware-related themes dominate sentiment outcomes, and multi-attribute evaluations drive both praise and criticism patterns.
These findings provide a data-driven basis for prioritizing product improvement efforts, strengthening quality assurance processes, and deploying early warning systems for negative feedback detection. The results demonstrate the value of association rule mining as a practical and interpretable framework for extracting behavioral structure from large-scale review data.