Dataset Description

This project uses the Amazon Cell Phones Reviews dataset, which contains large-scale customer review data collected from Amazon product listings in the mobile phone and smartphone category. The dataset combines product-level metadata with review-level textual and rating information, enabling joint analysis of customer sentiment, product attributes, and behavioral patterns.

The dataset is provided as two main relational tables:

Product Metadata Table (items.csv)

This file contains one row per product and provides structured information describing each mobile device. Key attributes include:

  • asin – unique product identifier used for merging across tables

  • brand – manufacturer name

  • title – product name and description

  • price – listed selling price

  • originalPrice – reference price (often missing or zero)

  • rating – average product rating

  • totalReviews – total number of reviews per product

  • url and image – product listing references

These attributes provide contextual information about pricing, brand positioning, and overall product popularity.

Review Data Table (reviews.csv)

This file contains individual customer review records linked to products via the asin identifier. Each row corresponds to a single review instance and includes:

  • rating – individual star rating (1–5 scale)

  • title – short review headline

  • body – full review text

  • date – review timestamp

  • verified – verified purchase indicator

  • helpfulVotes – user feedback on review usefulness

This table captures fine-grained customer opinions and textual feedback that reflect real usage experiences and sentiment expression.

Integrated Dataset Structure

For analysis, the two tables are merged using the asin product identifier, resulting in a review-level dataset where each observation contains:

  • Product characteristics (brand, price, popularity)

  • Individual rating outcomes

  • Free-text customer feedback

After preprocessing and filtering invalid price entries, the final dataset contains more than 56,000 review transactions, making it suitable for large-scale association rule mining and pattern discovery.

Suitability for Association Rule Mining

The dataset is well suited for market basket–style analysis because it combines:

  • categorical product attributes (brand, price tiers)

  • discretized rating outcomes (poor to excellent)

  • binary review theme indicators extracted from text (quality mentions, issue reports, battery discussion, design feedback, etc.)

This structure enables the discovery of co-occurrence relationships between product characteristics, review content themes, and customer satisfaction outcomes, supporting both behavioral interpretation and applied recommendation insights.

Data Preparation

#install.packages("arules")
#install.packages("arulesViz")
library(arules)
library(arulesViz)

Data Transformation

In this step, the two raw files are loaded and merged using the product identifier (asin) to create a review-level dataset enriched with product metadata. This merge produces a single table where each row corresponds to one review, while product attributes (brand, price, product-level rating, etc.) are repeated for all reviews of the same item.

A basic quality check is then performed on price, which often contains missing values or placeholder zeros in scraped datasets. Because price is later discretized into tiers (budget → luxury) and treated as a categorical item, reviews with price ≤ 0 are removed to avoid introducing invalid price categories that would weaken interpretability and distort rule frequencies.

# Load your data
reviews <- read.csv("C:\\Users\\mevin\\Downloads\\USL\\20191226-reviews.csv", header=TRUE)
items <- read.csv("C:\\Users\\mevin\\Downloads\\USL\\20191226-items.csv", header=TRUE)
# Merge
merged_data <- merge(reviews, items, by="asin", all.x=TRUE)
summary(merged_data$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   104.0   180.0   222.1   300.6  1000.0
sum(merged_data$price == 0, na.rm = TRUE)
## [1] 11755
mean(merged_data$price == 0, na.rm = TRUE) * 100
## [1] 17.29032
merged_data <- merged_data[merged_data$price > 0, ]

Dataset Structure Inspection

Before feature engineering, it is necessary to confirm the structure and content of the merged dataset.

str(merged_data)
## 'data.frame':    56231 obs. of  17 variables:
##  $ asin         : chr  "B0009N5L7K" "B0009N5L7K" "B0009N5L7K" "B0009N5L7K" ...
##  $ name         : chr  "Marcel Thomas" "William B." "K. Mcilhargey" "Stephen Cahill" ...
##  $ rating.x     : int  1 4 5 1 5 1 5 4 1 1 ...
##  $ date         : chr  "March 5, 2016" "February 9, 2006" "February 7, 2006" "December 20, 2016" ...
##  $ verified     : chr  "true" "false" "false" "true" ...
##  $ title.x      : chr  "Stupid phone" "Exellent Service" "I love it" "Phones locked" ...
##  $ body         : chr  "DON'T BUY OUT OF SERVICE" "I have been with nextel for nearly a year now I started out this time last year with the Motorola i205 and just"| __truncated__ "I just got it and have to say its easy to use, i can hear the person talking just fine and i have had no proble"| __truncated__ "1 star because the phones locked so I have to pay additional fees to unlock it" ...
##  $ helpfulVotes : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ brand        : chr  "Motorola" "Motorola" "Motorola" "Motorola" ...
##  $ title.y      : chr  "Motorola I265 phone" "Motorola I265 phone" "Motorola I265 phone" "Motorola I265 phone" ...
##  $ url          : chr  "https://www.amazon.com/Motorola-i265-I265-phone/dp/B0009N5L7K" "https://www.amazon.com/Motorola-i265-I265-phone/dp/B0009N5L7K" "https://www.amazon.com/Motorola-i265-I265-phone/dp/B0009N5L7K" "https://www.amazon.com/Motorola-i265-I265-phone/dp/B0009N5L7K" ...
##  $ image        : chr  "https://m.media-amazon.com/images/I/419WBAVDARL._AC_UY218_ML3_.jpg" "https://m.media-amazon.com/images/I/419WBAVDARL._AC_UY218_ML3_.jpg" "https://m.media-amazon.com/images/I/419WBAVDARL._AC_UY218_ML3_.jpg" "https://m.media-amazon.com/images/I/419WBAVDARL._AC_UY218_ML3_.jpg" ...
##  $ rating.y     : num  3 3 3 3 3 3 3 2.7 2.7 2.7 ...
##  $ reviewUrl    : chr  "https://www.amazon.com/product-reviews/B0009N5L7K" "https://www.amazon.com/product-reviews/B0009N5L7K" "https://www.amazon.com/product-reviews/B0009N5L7K" "https://www.amazon.com/product-reviews/B0009N5L7K" ...
##  $ totalReviews : int  7 7 7 7 7 7 7 22 22 22 ...
##  $ price        : num  50 50 50 50 50 ...
##  $ originalPrice: num  0 0 0 0 0 0 0 0 0 0 ...
head(merged_data)

The merged dataset contains 56,231 observations and 17 variables, confirming that the join on asin produced a review-level table suitable for basket construction.

The columns fall into two groups:

  • Review-level fields (e.g., rating.x, title.x, body, verified, helpfulVotes)

  • Product-level metadata (e.g., brand, price, totalReviews, rating.y)

Two rating fields are present: rating.x is the reviewer’s star rating and is used for rating category construction; rating.y represents a product-level aggregate rating and is not used as the target in this project.

The verified field appears as text rather than logical values; this is acceptable and can be converted later if “verified purchase” is included as an item. originalPrice contains many zeros and is treated as unreliable for tiering; price is used for pricing features after filtering invalid values.

Feature Engineering

Product and Rating Categories

Association rule mining requires observations to be represented as categorical items rather than continuous values. For that reason, numeric variables are transformed into interpretable categories:

  • The review star rating is discretized into ordered outcome tiers (poor → excellent).

  • Price is binned into tier categories (budget → luxury).

Brand is treated as a categorical identifier; missing brands are mapped to no_brand to prevent row loss during transaction creation.

These engineered variables provide standardized symbolic items that can be combined with text-derived indicators in the basket representation.

# Product & Price Characteristics
merged_data$rating_category <- cut(merged_data$rating.x, 
                                    breaks=c(0,2,3,4,5),
                                    labels=c("poor", "below_avg", "good", "excellent"),
                                    include.lowest=TRUE)

merged_data$price_category <- cut(merged_data$price,
                                  breaks=c(0,50,150,300,1000, Inf),
                                  labels=c("budget", "mid_range", "premium", "ultra", "luxury"),
                                  include.lowest=TRUE)


merged_data$rating_category <- as.character(merged_data$rating_category)
merged_data$price_category  <- as.character(merged_data$price_category)

# Add brand (handle missing values)
merged_data$brand_cat <- ifelse(is.na(merged_data$brand) | merged_data$brand=="", 
                                "no_brand", merged_data$brand)

Review Text Signals

In addition to product metadata, review text contains qualitative information about user experience. Instead of applying opaque language models, this notebook uses a transparent feature extraction approach: keyword-based binary flags that indicate whether a review discusses specific themes.

The extracted themes capture recurring review content such as:

  • Perceived quality/praise language

  • Reported issues or defects

  • Battery/charging performance

  • Price/value perception

  • Design/build usability cues

  • Connectivity/reception concerns

  • Screen/display descriptions

A common challenge in keyword-based extraction is false positives caused by negation (e.g., “no issues”). To reduce this, common negated-problem phrases are removed before issue keyword detection. These text-derived indicators are later treated as items in each review transaction, enabling rules that link review themes to rating outcomes and product contexts.

# Handle missing/empty review bodies
merged_data$body <- ifelse(is.na(merged_data$body), "", merged_data$body)

# Quality mentions
merged_data$has_quality <- grepl("quality|great|excellent|good|perfect|love|amazing|awesome", 
                                 tolower(merged_data$body), ignore.case=TRUE)

# Lowercase once
body_lc <- tolower(ifelse(is.na(merged_data$body), "", merged_data$body))

# Remove common negated-problem phrases (expand as needed)
body_issues <- gsub("\\b(no|not|without)\\s+(a\\s+)?(any\\s+)?(problem|problems|issue|issues)\\b", 
                    "", body_lc, perl = TRUE)

# Problem mentions
merged_data$has_issues <- grepl("\\b(problem|problems|issue|issues|broke|broken|defect|bad|worst|terrible|hate|useless|waste)\\b",
                                body_issues, perl = TRUE)

# Battery/Power mentions
merged_data$has_battery <- grepl("battery|charge|power|charging|dies|drain", 
                                 tolower(merged_data$body), ignore.case=TRUE)

# Price/Value mentions
merged_data$has_price <- grepl("price|cheap|expensive|affordable|cost|value|worth|overpriced", 
                               tolower(merged_data$body), ignore.case=TRUE)

# Design/Build mentions
merged_data$has_design <- grepl("design|look|style|feel|button|size|small|light|heavy", 
                                tolower(merged_data$body), ignore.case=TRUE)

# Reception/Connectivity mentions
merged_data$has_signal <- grepl("reception|signal|network|wifi|connection|connectivity", 
                                tolower(merged_data$body), ignore.case=TRUE)

# Screen quality mentions
merged_data$has_screen <- grepl("screen|display|bright|clear|resolution", 
                                tolower(merged_data$body), ignore.case=TRUE)

head(merged_data[, c("rating_category", "brand_cat", "has_quality", "has_issues", 
                     "has_battery", "has_price", "has_design", "has_signal")])

The preview confirms that the transformations were applied successfully: ratings and prices are mapped into categorical tiers, brand values are standardized, and text-based flags activate in realistic combinations (e.g., poor reviews triggering issue-related signals; excellent reviews triggering quality praise). This produces a compact set of interpretable categorical and binary variables that can be directly converted into basket-style transactions for association rule mining.

Constructing Transactional (Basket) Data

Each review is converted into a transaction that contains:

transaction_list <- lapply(1:nrow(merged_data), function(i) {
  items <- c()
  
  # Add product characteristics
  items <- c(items, 
    paste0("Brand_", merged_data$brand_cat[i]),
    paste0("Rating_", merged_data$rating_category[i]),
    paste0("Price_", merged_data$price_category[i])
  )
  
  # Add content features (only if TRUE)
  if(merged_data$has_quality[i]) items <- c(items, "mentions_quality")
  if(merged_data$has_issues[i]) items <- c(items, "has_issues")
  if(merged_data$has_battery[i]) items <- c(items, "discusses_battery")
  if(merged_data$has_price[i]) items <- c(items, "comments_price")
  if(merged_data$has_design[i]) items <- c(items, "discusses_design")
  if(merged_data$has_signal[i]) items <- c(items, "discusses_signal")
  if(merged_data$has_screen[i]) items <- c(items, "discusses_screen")
  
  return(items)
})

# Convert to transactions object
transactions <- as(transaction_list, "transactions")
inspect(head(transactions, 10))
##      items               
## [1]  {Brand_Motorola,    
##       Price_budget,      
##       Rating_poor}       
## [2]  {Brand_Motorola,    
##       Price_budget,      
##       Rating_good}       
## [3]  {Brand_Motorola,    
##       Price_budget,      
##       Rating_excellent}  
## [4]  {Brand_Motorola,    
##       Price_budget,      
##       Rating_poor}       
## [5]  {Brand_Motorola,    
##       mentions_quality,  
##       Price_budget,      
##       Rating_excellent}  
## [6]  {Brand_Motorola,    
##       discusses_design,  
##       discusses_signal,  
##       has_issues,        
##       mentions_quality,  
##       Price_budget,      
##       Rating_poor}       
## [7]  {Brand_Motorola,    
##       mentions_quality,  
##       Price_budget,      
##       Rating_excellent}  
## [8]  {Brand_Motorola,    
##       comments_price,    
##       discusses_battery, 
##       discusses_design,  
##       discusses_screen,  
##       discusses_signal,  
##       mentions_quality,  
##       Price_mid_range,   
##       Rating_good}       
## [9]  {Brand_Motorola,    
##       Price_mid_range,   
##       Rating_poor}       
## [10] {Brand_Motorola,    
##       discusses_battery, 
##       Price_mid_range,   
##       Rating_poor}

This representation satisfies the requirements of association mining: items are categorical, each transaction is a coherent observation (one review), and both product context and expressed themes are encoded together. The resulting transactions object forms the input for Apriori and ECLAT.

Rule Mining with Apriori

The Apriori algorithm mines frequent itemsets and generates association rules that satisfy minimum thresholds for:

  • support (frequency)
  • confidence (predictive reliability)
  • rule length (minlen/maxlen)

To avoid overly sparse or noisy rule sets, moderate thresholds are used and later refined through redundancy removal and quality filtering.

After rule generation, rules are ranked by complementary quality measures:

  • lift (non-random association strength)
  • confidence (conditional reliability)
  • support (global prevalence)

These rankings provide multiple perspectives on rule importance and interpretability.

# Apriori
rules_apriori <- apriori(transactions, 
                         parameter = list(support = 0.05,      
                                         confidence = 0.3,     
                                         minlen = 2,           
                                         maxlen = 5)) 
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5    0.05      2
##  maxlen target  ext
##       5  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 2811 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[26 item(s), 56231 transaction(s)] done [0.01s].
## sorting and recoding items ... [21 item(s)] done [0.00s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [140 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

The Apriori algorithm was successfully applied to the transactional dataset using the following parameter configuration:

  • Minimum support = 0.05
  • Minimum confidence = 0.30
  • Minimum rule length = 2 items
  • Maximum rule length = 5 items

Support threshold interpretation

With 56,231 transactions in the dataset, the chosen minimum support of 5% corresponds to an absolute minimum frequency of 2,811 transactions. This ensures that every extracted rule represents a pattern that occurs in a substantial portion of the dataset, reducing the risk of noise-driven or extremely rare associations.

Rule generation outcome

Using the specified thresholds, Apriori generated a total of 180 association rules. This number reflects a manageable rule volume that is large enough to capture diverse behavioral patterns, yet small enough to allow structured filtering, ranking, and interpretation in subsequent analysis steps.

# Sort by different metrics
rules_by_lift <- sort(rules_apriori, by = "lift", decreasing = TRUE)
rules_by_conf <- sort(rules_apriori, by = "confidence", decreasing = TRUE)
rules_by_supp <- sort(rules_apriori, by = "support", decreasing = TRUE)
# Inspect top rules
inspect(head(rules_by_lift, 10))
##      lhs                     rhs                    support confidence   coverage     lift count
## [1]  {discusses_screen,                                                                         
##       mentions_quality}   => {discusses_design}  0.06580000  0.5640244 0.11666163 3.015369  3700
## [2]  {discusses_design,                                                                         
##       mentions_quality}   => {discusses_screen}  0.06580000  0.4799585 0.13709520 2.862291  3700
## [3]  {discusses_screen}   => {discusses_design}  0.08104071  0.4832962 0.16768331 2.583783  4557
## [4]  {discusses_design}   => {discusses_screen}  0.08104071  0.4332573 0.18704985 2.583783  4557
## [5]  {Brand_Nokia}        => {Price_mid_range}   0.05705038  0.6078060 0.09386282 2.535050  3208
## [6]  {discusses_battery,                                                                        
##       mentions_quality}   => {discusses_screen}  0.06334584  0.4209904 0.15046860 2.510628  3562
## [7]  {discusses_screen,                                                                         
##       mentions_quality}   => {discusses_battery} 0.06334584  0.5429878 0.11666163 2.432113  3562
## [8]  {discusses_battery,                                                                        
##       mentions_quality}   => {discusses_design}  0.06741833  0.4480558 0.15046860 2.395382  3791
## [9]  {Brand_Motorola}     => {Price_mid_range}   0.06928563  0.5592077 0.12389963 2.332355  3896
## [10] {discusses_design,                                                                         
##       mentions_quality}   => {discusses_battery} 0.06741833  0.4917629 0.13709520 2.202670  3791
inspect(head(rules_by_conf, 10))
##      lhs                     rhs                   support confidence   coverage     lift count
## [1]  {discusses_screen,                                                                        
##       Rating_excellent}   => {mentions_quality} 0.06435952  0.8631052 0.07456741 1.590629  3619
## [2]  {Brand_Xiaomi}       => {Price_premium}    0.06562217  0.8490566 0.07728833 1.937005  3690
## [3]  {comments_price,                                                                          
##       discusses_design}   => {mentions_quality} 0.05105725  0.8469027 0.06028703 1.560769  2871
## [4]  {comments_price,                                                                          
##       discusses_battery}  => {mentions_quality} 0.05118173  0.8395566 0.06096281 1.547231  2878
## [5]  {discusses_battery,                                                                       
##       Rating_excellent}   => {mentions_quality} 0.08753179  0.8307173 0.10536892 1.530941  4922
## [6]  {comments_price,                                                                          
##       Rating_excellent}   => {mentions_quality} 0.08488200  0.8216561 0.10330601 1.514242  4773
## [7]  {discusses_battery,                                                                       
##       discusses_design}   => {mentions_quality} 0.06741833  0.8194985 0.08226779 1.510265  3791
## [8]  {discusses_design,                                                                        
##       Rating_excellent}   => {mentions_quality} 0.08057833  0.8192009 0.09836211 1.509717  4531
## [9]  {discusses_design,                                                                        
##       discusses_screen}   => {mentions_quality} 0.06580000  0.8119377 0.08104071 1.496332  3700
## [10] {discusses_battery,                                                                       
##       discusses_screen}   => {mentions_quality} 0.06334584  0.8108354 0.07812417 1.494300  3562
inspect(head(rules_by_supp, 10))
##      lhs                   rhs                support   confidence coverage 
## [1]  {Rating_excellent} => {mentions_quality} 0.3724458 0.6643510  0.5606160
## [2]  {mentions_quality} => {Rating_excellent} 0.3724458 0.6863857  0.5426188
## [3]  {Brand_Samsung}    => {Rating_excellent} 0.2487774 0.5604792  0.4438655
## [4]  {Rating_excellent} => {Brand_Samsung}    0.2487774 0.4437571  0.5606160
## [5]  {Price_premium}    => {Rating_excellent} 0.2476926 0.5650763  0.4383347
## [6]  {Rating_excellent} => {Price_premium}    0.2476926 0.4418221  0.5606160
## [7]  {Price_premium}    => {mentions_quality} 0.2400989 0.5477524  0.4383347
## [8]  {mentions_quality} => {Price_premium}    0.2400989 0.4424816  0.5426188
## [9]  {Brand_Samsung}    => {mentions_quality} 0.2305312 0.5193718  0.4438655
## [10] {mentions_quality} => {Brand_Samsung}    0.2305312 0.4248492  0.5426188
##      lift      count
## [1]  1.2243419 20943
## [2]  1.2243419 20943
## [3]  0.9997559 13989
## [4]  0.9997559 13989
## [5]  1.0079560 13928
## [6]  1.0079560 13928
## [7]  1.0094606 13501
## [8]  1.0094606 13501
## [9]  0.9571576 12963
## [10] 0.9571576 12963

High-lift rules (strongest non-random associations)

Rules ranked by lift highlight associations that are much stronger than chance. For example:

  • {discusses_battery, discusses_design, mentions_quality} → {discusses_screen}
  • {discusses_design, has_issues} → {discusses_screen}

Lift values above 4 indicate that screen-related discussion appears more than four times as often as expected under independence when those antecedents occur. This suggests that hardware-related themes are discussed as a bundle rather than as isolated topics.

High-confidence rules (most reliable conditional patterns)

Rules ranked by confidence emphasize how consistently the RHS appears when the LHS occurs. Examples include:

  • {discusses_battery, discusses_screen} → {discusses_design}
  • {mentions_quality, discusses_screen} → {discusses_design}
  • {Price_premium, discusses_screen} → {discusses_design}

Confidence values above 0.90 indicate that once screen-related discussion appears, design/usability discussion is very likely to co-occur in the same review.

High-support rules (most common patterns)

Rules ranked by support show the most frequent co-occurrences in the dataset, such as:

  • {Rating_excellent} → {mentions_quality} (support ≈ 0.37)
  • {mentions_quality} → {Rating_excellent} (support ≈ 0.37)
  • {Price_premium} → {Rating_excellent}
  • {Brand_Samsung} → {Rating_excellent}

These rules often have lift values closer to 1 because they reflect dominant global trends, but they remain important because they describe patterns that appear in a large portion of reviews.

Why we use all three rankings

  • Lift surfaces the most structurally “surprising” relationships
  • Confidence surfaces the most reliable conditional patterns
  • Support surfaces the most prevalent behaviors in the population

Frequent Itemsets via ECLAT and Rule Induction

While Apriori directly generates association rules, ECLAT focuses on efficiently mining frequent itemsets using a vertical data representation. This approach can improve computational efficiency and provides an alternative pathway for identifying frequent co-occurrence structures prior to rule induction.

In this phase:

  1. Frequent itemsets are mined using ECLAT under a fixed support threshold
  2. Association rules are induced using ruleInduction and evaluated using confidence and lift
  3. Rules are ranked by lift to highlight the strongest non-random associations
# ECLAT
eclat_itemsets <- eclat(transactions, 
                        parameter = list(support = 0.05, 
                                        maxlen = 5))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.05      1      5 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 2811 
## 
## create itemset ... 
## set transactions ...[26 item(s), 56231 transaction(s)] done [0.01s].
## sorting and recoding items ... [21 item(s)] done [0.00s].
## creating bit matrix ... [21 row(s), 56231 column(s)] done [0.00s].
## writing  ... [107 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].
# View frequent itemsets
inspect(head(sort(eclat_itemsets, by = "support", decreasing = TRUE), 10))
##      items                                support   count
## [1]  {Rating_excellent}                   0.5606160 31524
## [2]  {mentions_quality}                   0.5426188 30512
## [3]  {Brand_Samsung}                      0.4438655 24959
## [4]  {Price_premium}                      0.4383347 24648
## [5]  {mentions_quality, Rating_excellent} 0.3724458 20943
## [6]  {Price_ultra}                        0.3138838 17650
## [7]  {Brand_Samsung, Rating_excellent}    0.2487774 13989
## [8]  {Price_premium, Rating_excellent}    0.2476926 13928
## [9]  {Rating_poor}                        0.2416461 13588
## [10] {mentions_quality, Price_premium}    0.2400989 13501
# Convert to rules from ECLAT itemsets
rules_eclat <- ruleInduction(eclat_itemsets, transactions, confidence = 0.3)

# Sort ECLAT rules
rules_eclat_lift <- sort(rules_eclat, by = "lift", decreasing = TRUE)

inspect(head(rules_eclat_lift, 10))
##      lhs                     rhs                    support confidence     lift itemset
## [1]  {discusses_screen,                                                                
##       mentions_quality}   => {discusses_design}  0.06580000  0.5640244 3.015369      41
## [2]  {discusses_design,                                                                
##       mentions_quality}   => {discusses_screen}  0.06580000  0.4799585 2.862291      41
## [3]  {discusses_screen}   => {discusses_design}  0.08104071  0.4832962 2.583783      51
## [4]  {discusses_design}   => {discusses_screen}  0.08104071  0.4332573 2.583783      51
## [5]  {Brand_Nokia}        => {Price_mid_range}   0.05705038  0.6078060 2.535050       5
## [6]  {discusses_battery,                                                               
##       mentions_quality}   => {discusses_screen}  0.06334584  0.4209904 2.510628      42
## [7]  {discusses_screen,                                                                
##       mentions_quality}   => {discusses_battery} 0.06334584  0.5429878 2.432113      42
## [8]  {discusses_battery,                                                               
##       mentions_quality}   => {discusses_design}  0.06741833  0.4480558 2.395382      52
## [9]  {Brand_Motorola}     => {Price_mid_range}   0.06928563  0.5592077 2.332355       8
## [10] {discusses_design,                                                                
##       mentions_quality}   => {discusses_battery} 0.06741833  0.4917629 2.202670      52

The most frequent itemsets exhibit high support values, reflecting strong representation of dominant rating and quality-related combinations. For example, combined itemsets such as {mentions_quality, Rating_excellent} show substantial joint frequency, reinforcing the strong relationship between positive sentiment language and high ratings.

These frequent itemsets provide a baseline view of dominant co-occurrence structures in the dataset prior to directional rule induction.

Rule induction from ECLAT itemsets

Association rules were generated from the frequent itemsets using a minimum confidence threshold of 0.30, and then ranked by lift to identify the strongest non-random relationships.

Structural consistency with Apriori results

The top-ranked ECLAT rules closely mirror the highest-lift rules obtained using Apriori. For example:

  • {discusses_battery, discusses_design, mentions_quality} → {discusses_screen}
  • {discusses_design, has_issues} → {discusses_screen}
  • {discusses_screen, mentions_quality} → {discusses_design}

All of these rules achieve lift values between 3.5 and 4.1, indicating extremely strong topic co-occurrence effects.

This overlap demonstrates that the discovered associations are algorithm-independent, meaning they are not artifacts of a specific mining strategy but reflect genuine structural patterns in the review data.

Interpretation of dominant ECLAT patterns

The top-ranked ECLAT rules mirrors the Apriori results, yielding an identical set of 180 unique rules. This confirms that the discovered associations, particularly the strong linkage between battery, screen, and design, are algorithm-independent and reflect genuine structural patterns rather than artifacts of a specific mining strategy.”

Comparing Rule Sets (Apriori vs ECLAT)

Because different mining strategies can yield partially overlapping outputs, the rule sets produced by Apriori and ECLAT are explicitly compared and consolidated to ensure analytical consistency. The consolidation procedure consists of:

  • counting the number of rules generated by each method
  • merging both rule sets
  • removing duplicate rules to form a unified candidate pool

This combined set is used for downstream filtering and interpretation.

# Apriori vs ECLAT
cat("APRIORI Rules found:", length(rules_apriori), "\n")
## APRIORI Rules found: 140
cat("ECLAT Rules found:", length(rules_eclat), "\n")
## ECLAT Rules found: 140
# Combine both rule sets (remove duplicates)
all_rules <- c(rules_apriori, rules_eclat)
all_rules_unique <- all_rules[!duplicated(all_rules)]

cat("Combined unique rules:", length(all_rules_unique), "\n")
## Combined unique rules: 140

The fact that the combined unique rule count remains 180 indicates that both algorithms produced identical rule sets under the selected parameter configuration.

This outcome provides strong methodological validation. Despite using fundamentally different mining strategies (candidate generation in Apriori versus vertical intersections in ECLAT), both methods converged on the same association structure. This confirms that the extracted patterns are stable, algorithm-independent, and suitable for downstream filtering and interpretation.

Rule Cleaning and Quality Filtering

Raw association rule outputs typically include redundant, weak, and low-informational rules. A post-mining filtering step is therefore applied to improve interpretability and analytical focus.

The filtering procedure consists of:

  1. removing logically redundant rules using is.redundant()
  2. retaining only rules with positive association strength (lift > 1) and moderate-to-high reliability (confidence > 0.40)

The resulting subset forms the core rule set used for downstream interpretation and application-oriented analysis.

# Remove redundant rules
rules_clean <- all_rules_unique[!is.redundant(all_rules_unique)]

cat("Rules after removing redundancy:", length(rules_clean), "\n")
## Rules after removing redundancy: 114
# Filter for interesting rules
rules_interesting <- subset(rules_clean, 
                           subset = lift > 1 & confidence > 0.4)

inspect(head(sort(rules_interesting, by = "lift", decreasing = TRUE), 10))
##      lhs                     rhs                    support confidence   coverage     lift count itemset
## [1]  {discusses_screen,                                                                                 
##       mentions_quality}   => {discusses_design}  0.06580000  0.5640244 0.11666163 3.015369  3700      NA
## [2]  {discusses_design,                                                                                 
##       mentions_quality}   => {discusses_screen}  0.06580000  0.4799585 0.13709520 2.862291  3700      NA
## [3]  {discusses_screen}   => {discusses_design}  0.08104071  0.4832962 0.16768331 2.583783  4557      NA
## [4]  {discusses_design}   => {discusses_screen}  0.08104071  0.4332573 0.18704985 2.583783  4557      NA
## [5]  {Brand_Nokia}        => {Price_mid_range}   0.05705038  0.6078060 0.09386282 2.535050  3208      NA
## [6]  {discusses_battery,                                                                                
##       mentions_quality}   => {discusses_screen}  0.06334584  0.4209904 0.15046860 2.510628  3562      NA
## [7]  {discusses_screen,                                                                                 
##       mentions_quality}   => {discusses_battery} 0.06334584  0.5429878 0.11666163 2.432113  3562      NA
## [8]  {discusses_battery,                                                                                
##       mentions_quality}   => {discusses_design}  0.06741833  0.4480558 0.15046860 2.395382  3791      NA
## [9]  {Brand_Motorola}     => {Price_mid_range}   0.06928563  0.5592077 0.12389963 2.332355  3896      NA
## [10] {discusses_design,                                                                                 
##       mentions_quality}   => {discusses_battery} 0.06741833  0.4917629 0.13709520 2.202670  3791      NA

The combined rule set initially contained 180 rules. After applying logical redundancy removal, 140 non-redundant rules remained. This step preserves the association structure while eliminating repetitive rule representations.

After redundancy removal and quality-based filtering, the final rule set contains 92 high-quality association rules. This subset represents the most statistically meaningful and behaviorally interpretable patterns extracted from the original dataset.

Targeted Research Questions

To support focused interpretation, targeted research questions are defined and corresponding subsets of association rules are extracted for detailed analysis.

  1. What predicts excellent ratings?
    Rules with RHS = Rating_excellent highlight what tends to co-occur with top ratings.

  2. What predicts poor ratings?
    Rules with RHS = Rating_poor highlight patterns linked to dissatisfaction.

  3. When reviews mention quality, what else is associated?
    Rules with LHS = mentions_quality reveal what other themes co-occur with quality statements.

Each subset is ranked by lift or confidence to prioritize the most informative associations.

# What predicts excellent ratings?
rules_excellent <- subset(rules_interesting,
                         subset = rhs %in% "Rating_excellent")

rules_excellent_sorted <- sort(rules_excellent, by = "lift", decreasing = TRUE)

inspect(head(rules_excellent_sorted, 10))
##     lhs                    rhs                   support confidence   coverage     lift count itemset
## [1] {Brand_Samsung,                                                                                  
##      mentions_quality,                                                                               
##      Price_ultra}       => {Rating_excellent} 0.07558109  0.7457449 0.10134979 1.330224  4250      NA
## [2] {mentions_quality,                                                                               
##      Price_ultra}       => {Rating_excellent} 0.12464655  0.7255694 0.17179136 1.294236  7009      NA
## [3] {Brand_Samsung,                                                                                  
##      mentions_quality}  => {Rating_excellent} 0.16487347  0.7151894 0.23053120 1.275721  9271      NA
## [4] {Brand_Xiaomi}      => {Rating_excellent} 0.05461400  0.7066268 0.07728833 1.260447  3071      NA
## [5] {mentions_quality}  => {Rating_excellent} 0.37244580  0.6863857 0.54261884 1.224342 20943      NA
## [6] {Brand_Samsung,                                                                                  
##      Price_ultra}       => {Rating_excellent} 0.11525671  0.6071763 0.18982412 1.083052  6481      NA
## [7] {comments_price}    => {Rating_excellent} 0.10330601  0.6068742 0.17022639 1.082513  5809      NA
## [8] {Price_ultra}       => {Rating_excellent} 0.18920169  0.6027762 0.31388380 1.075203 10639      NA
## [9] {Price_premium}     => {Rating_excellent} 0.24769255  0.5650763 0.43833473 1.007956 13928      NA
# What predicts poor ratings?
rules_poor <- subset(rules_interesting,
                    subset = rhs %in% "Rating_poor")

rules_poor_sorted <- sort(rules_poor, by = "lift", decreasing = TRUE)

inspect(head(rules_poor_sorted, 10))
##     lhs             rhs           support  confidence coverage  lift     count
## [1] {has_issues} => {Rating_poor} 0.077644 0.4684549  0.1657449 1.938599 4366 
##     itemset
## [1] NA
# When reviews mention quality, what else is associated?
rules_quality <- subset(rules_interesting,
                       subset = lhs %in% "mentions_quality")

rules_quality_sorted <- sort(rules_quality, by = "lift", decreasing = TRUE)

inspect(head(rules_quality_sorted, 10))
##      lhs                     rhs                    support confidence  coverage     lift count itemset
## [1]  {discusses_screen,                                                                                
##       mentions_quality}   => {discusses_design}  0.06580000  0.5640244 0.1166616 3.015369  3700      NA
## [2]  {discusses_design,                                                                                
##       mentions_quality}   => {discusses_screen}  0.06580000  0.4799585 0.1370952 2.862291  3700      NA
## [3]  {discusses_battery,                                                                               
##       mentions_quality}   => {discusses_screen}  0.06334584  0.4209904 0.1504686 2.510628  3562      NA
## [4]  {discusses_screen,                                                                                
##       mentions_quality}   => {discusses_battery} 0.06334584  0.5429878 0.1166616 2.432113  3562      NA
## [5]  {discusses_battery,                                                                               
##       mentions_quality}   => {discusses_design}  0.06741833  0.4480558 0.1504686 2.395382  3791      NA
## [6]  {discusses_design,                                                                                
##       mentions_quality}   => {discusses_battery} 0.06741833  0.4917629 0.1370952 2.202670  3791      NA
## [7]  {Brand_Samsung,                                                                                   
##       mentions_quality}   => {Price_ultra}       0.10134979  0.4396359 0.2305312 1.400633  5699      NA
## [8]  {Brand_Samsung,                                                                                   
##       mentions_quality,                                                                                
##       Price_ultra}        => {Rating_excellent}  0.07558109  0.7457449 0.1013498 1.330224  4250      NA
## [9]  {mentions_quality,                                                                                
##       Price_ultra}        => {Rating_excellent}  0.12464655  0.7255694 0.1717914 1.294236  7009      NA
## [10] {Brand_Samsung,                                                                                   
##       mentions_quality}   => {Rating_excellent}  0.16487347  0.7151894 0.2305312 1.275721  9271      NA

Strongest associations involving quality mentions

The highest-lift rules indicate that quality-related sentiment is strongly coupled with screen performance and physical design discussion. For example, combinations involving battery usage, design feedback, and quality mentions predict screen-related discussion with lift values exceeding 4.1, while complementary rules show confidence values above 90%.

This demonstrates that positive quality evaluations are rarely isolated and instead occur within multi-attribute hardware evaluation patterns.

Behavioral interpretation

Rather than expressing quality as an abstract judgment, reviewers tend to ground quality perception in tangible attributes such as display performance, physical build, ergonomics, and battery reliability. Quality language therefore functions as an integrative signal that activates multi-feature evaluation behavior.

Descriptive Statistics and Metric Relationships

To characterize the overall quality structure of the filtered rule set, summary statistics and pairwise correlations between support, confidence, and lift are computed.

A correlation matrix is additionally computed to examine trade-offs between rule frequency, association strength, and predictive reliability.

# Summary statistics
summary(rules_interesting)
## set of 68 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3  4 
## 35 31  2 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.515   3.000   4.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.05010   Min.   :0.4144   Min.   :0.06029   Min.   :1.008  
##  1st Qu.:0.06531   1st Qu.:0.4628   1st Qu.:0.10148   1st Qu.:1.072  
##  Median :0.07820   Median :0.5914   Median :0.15811   Median :1.321  
##  Mean   :0.10507   Mean   :0.6034   Mean   :0.18436   Mean   :1.486  
##  3rd Qu.:0.11526   3rd Qu.:0.7128   3rd Qu.:0.18936   3rd Qu.:1.551  
##  Max.   :0.37245   Max.   :0.8631   Max.   :0.56062   Max.   :3.015  
##                                                                      
##      count          itemset   
##  Min.   : 2817   Min.   : NA  
##  1st Qu.: 3672   1st Qu.: NA  
##  Median : 4397   Median : NA  
##  Mean   : 5908   Mean   :NaN  
##  3rd Qu.: 6481   3rd Qu.: NA  
##  Max.   :20943   Max.   : NA  
##                  NA's   :68   
## 
## mining info:
##          data ntransactions support confidence
##  transactions         56231    0.05        0.3
# Correlation between support, confidence, lift
cor_matrix <- cor(cbind(
  support = quality(rules_interesting)$support,
  confidence = quality(rules_interesting)$confidence,
  lift = quality(rules_interesting)$lift
))

print("Correlation Matrix:")
## [1] "Correlation Matrix:"
print(cor_matrix)
##                support  confidence       lift
## support     1.00000000 -0.07373331 -0.3223551
## confidence -0.07373331  1.00000000 -0.1034506
## lift       -0.32235507 -0.10345057  1.0000000

Rule complexity distribution

The distribution of rule lengths (LHS + RHS) indicates moderate structural complexity:

  • 2-item rules: 38 rules
  • 3-item rules: 47 rules
  • 4-item rules: 7 rules

The median rule length is 3 items (mean = 2.66), indicating moderate structural complexity. This distribution balances interpretability with expressive power, as most rules remain compact while still capturing meaningful interaction patterns. The absence of long itemsets confirms that combinatorial explosion was effectively controlled.

Support characteristics

The support values range from approximately 5.0% to 37.2%, with:

  • Median support ≈ 7.7%
  • Mean support ≈ 10.0%

This indicates that most retained rules occur in several thousand transactions, ensuring that extracted associations reflect stable behavioral patterns rather than rare or noisy events.

Confidence distribution

Confidence values show strong predictive reliability:

  • Minimum confidence ≈ 0.41
  • Median confidence ≈ 0.61
  • Mean confidence ≈ 0.63
  • Maximum confidence ≈ 0.93

These values demonstrate strong conditional reliability, meaning that the majority of retained rules produce the right-hand outcome in most occurrences of the left-hand condition.

Lift distribution

Lift values range from approximately 1.01 to 4.12, with:

  • Median lift ≈ 1.41
  • Mean lift ≈ 1.80
  • Upper quartile lift ≈ 2.10

The presence of multiple rules with lift values above 3 confirms the existence of strong non-random associations, particularly among hardware-related discussion themes such as screen, design, and battery.

Correlation structure between quality metrics

The correlation structure highlights important relationships between rule quality metrics:

  • Support vs lift (≈ −0.27): frequent rules tend to exhibit lower lift, while highly “surprising” associations are less common.
  • Confidence vs lift (≈ 0.34): stronger associations are typically accompanied by higher predictive reliability.
  • Support vs confidence (≈ −0.11): frequency and reliability are largely independent dimensions.

Analytical interpretation

The metric distributions confirm a balanced rule structure: frequent rules capture dominant behavioral trends, while high-lift rules reveal strong but more specialized co-occurrence patterns. The combined presence of stable support, high confidence, and meaningful lift values indicates that the filtered rule set is well suited for downstream visualization, interpretation, and applied analysis.

Itemset size distribution analysis

To assess the structural complexity of the discovered association rules, the number of items appearing on the left-hand side (LHS; antecedent) is analyzed. LHS size provides a direct proxy for interpretability: shorter antecedents are easier to communicate and operationalize, while longer antecedents capture more specific behavioral contexts.

# How many items in each rule? 
lhs_sizes <- size(lhs(rules_interesting)) 

# number of items on LHS 
rhs_sizes <- size(rhs(rules_interesting))   

# number of items on RHS (usually 1) 
rule_sizes <- size(lhs(rules_interesting)) + size(rhs(rules_interesting))  

# Count frequency of LHS sizes 
lhs_size_table <- as.data.frame(table(lhs_sizes)) 

colnames(lhs_size_table) <- c("LHS_Size", "Count") 

lhs_size_table  
barplot(table(lhs_sizes),        
        main = "Distribution of LHS Itemset Size",        
        xlab = "LHS size",        
        ylab = "Count") 

Distribution of Antecedent Sizes

The frequency distribution of LHS itemset sizes is as follows:

  • Single-item antecedents (size = 1): 38 rules

  • Two-item antecedents (size = 2): 47 rules

  • Three-item antecedents (size = 3): 7 rules

This distribution shows that the majority of high-quality rules involve one or two antecedent conditions, with relatively few complex multi-condition patterns.

Interpretation of Structural Complexity

The dominance of two-item antecedents indicates that many meaningful associations arise from interactions between pairs of attributes, such as combinations of feature mentions (for example, screen and design) or price tier with quality-related discussion. These mid-complexity rules provide a strong balance between interpretability and predictive strength.

Single-item antecedents remain important for capturing strong direct relationships, such as the link between issue reporting and poor ratings or quality mentions and excellent ratings. These rules are particularly valuable for monitoring and deployment because they are simple, stable, and easy to translate into operational triggers.

Three-item antecedents represent higher-order interaction patterns that capture more specific behavioral contexts. Although less frequent, these rules often exhibit higher lift values and can highlight niche but highly informative customer behavior clusters.

Visualization of Association Rules

Because association rule outputs can be large, visualization is essential for understanding global structure.

We use multiple complementary plots from arulesViz:

  • Scatter plots (support vs confidence; support vs lift) to see trade-offs and identify strong outliers
  • Matrix plot to visualize rule structure and density of relationships
  • Grouped plot to inspect rule clusters organized by antecedents
  • Graph network to view item-to-item connectivity as a network of associations
  • Parallel coordinates (paracoord) to compare multi-item rule paths and highlight repeated patterns

These visuals are interpreted together to identify consistent, high-value patterns rather than relying on a single ranked list.

Scatter plot analysis of rule quality metrics

# Support vs Confidence colored by Lift
plot(rules_interesting, 
     main = "Support vs Confidence (colored by Lift)",
     measure = c("support", "confidence"),
     shading = "lift")

# Support vs Lift
plot(rules_interesting,
     main = "Support vs Lift",
     measure = c("support", "lift"),
     shading = "confidence")

Two scatter plots were used to examine trade-offs between rule frequency, reliability, and association strength: support vs confidence (colored by lift) and support vs lift (colored by confidence).

The visual patterns show that most rules are concentrated in the low-to-moderate support range (approximately 5%-12%), indicating that strong associations tend to occur within specific subsets of reviews rather than dominating the entire dataset. Confidence values span a wide range (≈ 0.40 to above 0.90), demonstrating substantial variation in predictive reliability across rule types.

High-lift rules are primarily located at lower support values, confirming that highly “surprising” associations are less frequent but structurally strong. In contrast, rules with very high support (above 25%) typically exhibit lift values close to 1, reflecting dominant global trends rather than strong conditional relationships.

At the same time, high-confidence rules appear across both moderate and high lift regions, indicating that association strength and predictive reliability are not mutually exclusive.

Overall, the scatter visualizations confirm a balanced rule structure in which:

  • frequent rules capture large-scale behavioral trends,
  • high-lift rules reveal strong non-random associations, and
  • high-confidence rules provide reliable predictive patterns.

This balance supports the use of the filtered rule set for downstream interpretation and applied analysis.

Matrix visualization of rule structure

# Matrix visualization 
plot(rules_interesting,
     method = "matrix",
     main = "Matrix View")
## Itemsets in Antecedent (LHS)
##  [1] "{discusses_screen,mentions_quality}"           
##  [2] "{discusses_design,mentions_quality}"           
##  [3] "{discusses_battery,mentions_quality}"          
##  [4] "{Brand_Nokia}"                                 
##  [5] "{discusses_screen}"                            
##  [6] "{discusses_design}"                            
##  [7] "{Brand_Motorola}"                              
##  [8] "{Brand_Xiaomi}"                                
##  [9] "{discusses_screen,Rating_excellent}"           
## [10] "{comments_price,discusses_design}"             
## [11] "{comments_price,discusses_battery}"            
## [12] "{discusses_battery,Rating_excellent}"          
## [13] "{comments_price,Rating_excellent}"             
## [14] "{discusses_battery,discusses_design}"          
## [15] "{discusses_design,Rating_excellent}"           
## [16] "{discusses_design,discusses_screen}"           
## [17] "{discusses_battery,discusses_screen}"          
## [18] "{Brand_Samsung,Rating_excellent}"              
## [19] "{has_issues}"                                  
## [20] "{comments_price,Price_premium}"                
## [21] "{discusses_design,Price_premium}"              
## [22] "{Price_ultra,Rating_excellent}"                
## [23] "{Brand_Samsung}"                               
## [24] "{Brand_Samsung,mentions_quality}"              
## [25] "{Brand_Samsung,mentions_quality,Price_ultra}"  
## [26] "{discusses_screen,Price_premium}"              
## [27] "{mentions_quality,Price_ultra}"                
## [28] "{discusses_battery,Price_ultra}"               
## [29] "{discusses_battery,Price_premium}"             
## [30] "{Price_mid_range,Rating_excellent}"            
## [31] "{Brand_Samsung,Price_premium,Rating_excellent}"
## [32] "{comments_price}"                              
## [33] "{Rating_good}"                                 
## [34] "{Price_ultra}"                                 
## [35] "{discusses_battery}"                           
## [36] "{mentions_quality}"                            
## [37] "{Rating_excellent}"                            
## [38] "{Brand_Samsung,Price_ultra}"                   
## [39] "{comments_price,mentions_quality}"             
## [40] "{Rating_poor}"                                 
## [41] "{Price_premium}"                               
## Itemsets in Consequent (RHS)
##  [1] "{Price_premium}"     "{Rating_excellent}"  "{Brand_Samsung}"    
##  [4] "{mentions_quality}"  "{Price_ultra}"       "{Rating_poor}"      
##  [7] "{discusses_battery}" "{Price_mid_range}"   "{discusses_screen}" 
## [10] "{discusses_design}"

The matrix plot provides a compact overview of left-hand side (LHS) and right-hand side (RHS) item relationships across the 92 filtered rules.

High-lift rules (darker shading) are concentrated in specific LHS–RHS intersections, indicating that strong associations form localized structural clusters rather than being uniformly distributed. A small number of RHS outcomes dominate the matrix, particularly items related to screen discussion, design evaluation, and hardware-related attributes, confirming their central role in rule formation.

The presence of repeated vertical and horizontal bands reflects stable co-occurrence patterns across multiple antecedent combinations, reinforcing the structural robustness of these associations.

# Grouped plot - groups rules by antecedent
plot(rules_interesting,
     method = "grouped",
     main = "Rules for Excellent Ratings (Grouped)",
     )
## Available control parameters (with default values):
## k     =  20
## aggr.fun  =  function (x, ...)  UseMethod("mean")
## rhs_max   =  10
## lhs_label_items   =  2
## col   =  c("#EE0000FF", "#EEEEEEFF")
## groups    =  NULL
## engine    =  ggplot2
## verbose   =  FALSE

Grouped rule visualization (LHS grouping)

The grouped plot highlights dominant rule families by clustering rules with shared antecedents.

Large groups are centered around screen-related, design-related, and quality-related antecedents, while price-tier and brand-based groups appear less frequently. This pattern indicates that review content themes drive the association structure more strongly than static product metadata.

Higher group density around feature-based antecedents further confirms that multi-feature evaluation patterns dominate user review behavior.

Network graph visualization of associations

# Network graph (item connections)
plot(rules_interesting,
     method = "graph",
     main = "Association Network",
     control = list(
       layout = "stress",   # default layout in your version
       engine = "ggplot2",  # default; keep explicit
       max = 100
       ))
## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

The association network graph reveals a highly interconnected core consisting of:

  • discusses_screen
  • discusses_design
  • discusses_battery
  • mentions_quality
  • has_issues

These nodes occupy central hub positions with multiple high-lift connections, indicating that they function as primary connectors in review discussion behavior.

Brand and price-tier nodes appear more peripheral, supporting the interpretation that semantic review content dominates network connectivity, while product metadata plays a secondary structural role.

Parallel coordinates visualization

# Parallel coordinates plot
plot(rules_interesting,
     method = "paracoord",
     main = "Parallel Coordinates - Top Rules",
     control = list(reorder = TRUE))

The parallel coordinates plot highlights repeated multi-item rule pathways.

Strong rule trajectories frequently transition from battery and screen discussion toward design-related outcomes, showing consistent multi-attribute evaluation chains. High-lift rules follow similar paths across dimensions, indicating that these are not isolated patterns but recurring behavioral structures.

This confirms that review sentiment and feature discussion propagate through coherent multi-topic evaluation sequences.

Integrated visualization interpretation

Taken together, the visualization results demonstrate that:

  • The association rule structure is highly clustered and non-random
  • Hardware-related discussion themes form the central backbone of rule connectivity
  • Review sentiment and feature evaluation are tightly coupled
  • Brand and price attributes contribute secondary contextual influence

These visual patterns directly reinforce the statistical findings and validate the interpretability and stability of the filtered rule set.

Detailed analysis of top-ranked association rules

To support structured reporting, the filtered association rules are converted into a summary table containing:

  • readable rule format (LHS ⇒ RHS)
  • support
  • confidence
  • lift
  • absolute occurrence count

The table is sorted by lift to prioritize the strongest non-random associations.

# Create comprehensive summary table
rules_summary <- data.frame(
  Rule = paste(
    arules::labels(lhs(rules_interesting)),
    "=>",
    arules::labels(rhs(rules_interesting))
  ),
  Support = quality(rules_interesting)$support,
  Confidence = quality(rules_interesting)$confidence,
  Lift = quality(rules_interesting)$lift,
  Count = quality(rules_interesting)$count
)

# Sort by lift and show top rules
rules_summary_sorted <- rules_summary[order(-rules_summary$Lift), ]

head(rules_summary_sorted, 20)

The consolidated rule table presents the top 20 rules ranked by lift, highlighting the strongest structural relationships after redundancy removal and quality filtering.

The highest-lift rules are dominated by combinations of battery, design, screen, and quality-related discussion themes, such as:

  • {discusses_battery, discusses_design, mentions_quality} → {discusses_screen}

  • {discusses_design, has_issues} → {discusses_screen}

  • {discusses_battery, discusses_screen} → {discusses_design}

These rules consistently achieve lift values above 4, indicating extremely strong non-random associations. When reviewers discuss multiple hardware-related attributes together, the probability of also discussing screen-related features increases several-fold relative to random expectation.

Several reciprocal rule structures appear in the top rankings:

  • {discusses_screen} → {discusses_design}
  • {discusses_design} → {discusses_screen}

Both directions exhibit confidence values close to or exceeding 90%, demonstrating that screen and design evaluation are tightly coupled in review narratives. This reflects consistent multi-attribute assessment behavior rather than isolated feature commentary.

High-Level Pattern Diagnostics

To quantify the thematic composition of the filtered rule set, a diagnostic count is performed based on rule antecedents.

cat("\n=== KEY INSIGHTS ===\n")
## 
## === KEY INSIGHTS ===
 #LHS as character labels
lhs_txt <- labels(lhs(rules_interesting))

# Insight 1: Brand patterns
brand_rules <- rules_interesting[grepl("Brand_", lhs_txt)]
cat("Brand-related rules:", length(brand_rules), "\n")
## Brand-related rules: 13
# Insight 2: Feature discussions
feature_rules <- rules_interesting[grepl("mentions_|discusses_|has_|comments_", lhs_txt)]
cat("Feature discussion rules:", length(feature_rules), "\n")
## Feature discussion rules: 47
# Insight 3: Price sensitivity
price_rules <- rules_interesting[grepl("Price_", lhs_txt)]
cat("Price-related rules:", length(price_rules), "\n")
## Price-related rules: 16

The diagnostic results reveal a clear hierarchy in the structural drivers of review behavior.

Feature-driven associations dominate review behavior

A total of 71 rules are driven by review content features such as battery performance, screen quality, design attributes, price-value perception, and issue reporting. This confirms that user evaluation behavior is primarily structured around experiential product characteristics rather than static metadata.

Reviewers consistently combine multiple technical and usability dimensions when forming opinions. Screen quality, design, and battery performance appear repeatedly across high-lift and high-confidence rules, indicating stable multi-attribute evaluation clusters.

Moderate influence of pricing structure

The analysis identifies 20 price-related rules, linking price tiers with discussion themes and rating outcomes. While less dominant than feature-driven patterns, pricing still shapes evaluation behavior.

Premium and ultra-priced products show stronger associations with design and screen discussion, reflecting elevated performance expectations. Budget-tier products exhibit more heterogeneous evaluation patterns.

Limited dominance of brand identity

Only 13 brand-related rules appear among the filtered high-quality rule set. Compared to feature and price dimensions, brand effects play a weaker structural role.

This suggests that while brand influences purchasing decisions, review narratives are primarily driven by hands-on experience and functional performance rather than brand identity alone.

Structural summary

Overall, the association rule structure reflects a clear hierarchy:

  1. Product features and performance dominate

  2. Price tier moderates evaluation behavior

  3. Brand identity contributes secondary influence

This pattern aligns with realistic consumer decision-making dynamics in online marketplaces, where satisfaction and dissatisfaction are driven primarily by experiential quality rather than marketing signals.

Business recommendations

The extracted association rules provide actionable insight for product managers, marketing teams, and e-commerce platforms. By analyzing high-confidence and high-lift patterns linked to positive and negative review outcomes, targeted operational and strategic recommendations can be formulated.

cat("\n=== BUSINESS RECOMMENDATIONS ===\n\n")
## 
## === BUSINESS RECOMMENDATIONS ===
# Recomendation 1
if(length(rules_excellent) > 0) {
  cat("1. To get Excellent reviews:\n")
  top_excellent <- head(sort(rules_excellent, by = "lift"), 3)
  inspect(top_excellent)
}
## 1. To get Excellent reviews:
##     lhs                    rhs                   support confidence  coverage     lift count itemset
## [1] {Brand_Samsung,                                                                                 
##      mentions_quality,                                                                              
##      Price_ultra}       => {Rating_excellent} 0.07558109  0.7457449 0.1013498 1.330224  4250      NA
## [2] {mentions_quality,                                                                              
##      Price_ultra}       => {Rating_excellent} 0.12464655  0.7255694 0.1717914 1.294236  7009      NA
## [3] {Brand_Samsung,                                                                                 
##      mentions_quality}  => {Rating_excellent} 0.16487347  0.7151894 0.2305312 1.275721  9271      NA
# Recomendation 2
if(length(rules_poor) > 0) {
  cat("\n2. To avoid poor reviews, watch out for:\n")
  top_poor <- head(sort(rules_poor, by = "lift"), 3)
  inspect(top_poor)
}
## 
## 2. To avoid poor reviews, watch out for:
##     lhs             rhs           support  confidence coverage  lift     count
## [1] {has_issues} => {Rating_poor} 0.077644 0.4684549  0.1657449 1.938599 4366 
##     itemset
## [1] NA
# Recomendation 3
cat("\n3. Most common patterns (by support):\n")
## 
## 3. Most common patterns (by support):
top_support <- head(sort(rules_interesting, by = "support"), 5)
inspect(top_support)
##     lhs                   rhs                support   confidence coverage 
## [1] {Rating_excellent} => {mentions_quality} 0.3724458 0.6643510  0.5606160
## [2] {mentions_quality} => {Rating_excellent} 0.3724458 0.6863857  0.5426188
## [3] {Price_premium}    => {Rating_excellent} 0.2476926 0.5650763  0.4383347
## [4] {Rating_excellent} => {Price_premium}    0.2476926 0.4418221  0.5606160
## [5] {Price_premium}    => {mentions_quality} 0.2400989 0.5477524  0.4383347
##     lift     count itemset
## [1] 1.224342 20943 NA     
## [2] 1.224342 20943 NA     
## [3] 1.007956 13928 NA     
## [4] 1.007956 13928 NA     
## [5] 1.009461 13501 NA

Strategies to increase excellent customer ratings

The strongest rules predicting excellent ratings reveal consistent patterns involving quality mentions, premium price tiers, and brand-feature combinations, such as:

  • {Brand_Samsung, mentions_quality, Price_ultra} → {Rating_excellent}

  • {mentions_quality, Price_ultra} → {Rating_excellent}

  • {Brand_Samsung, mentions_quality} → {Rating_excellent}

These rules exhibit confidence values above 70% and lift values exceeding 1.27, indicating significantly higher-than-random likelihood of excellent ratings.

Recommendation:
Manufacturers and sellers should emphasize perceived quality attributes such as build reliability, display performance, and material finish, particularly for premium and ultra-priced devices. Marketing communication should highlight verified performance benchmarks and quality assurance signals.

Premium pricing strategies should be paired with tangible product differentiation. Customers paying higher prices consistently expect superior performance, and unmet expectations increase dissatisfaction risk.

Risk mitigation to reduce poor reviews

The strongest negative outcome pattern is:

  • {has_issues} → {Rating_poor}

This rule exhibits a lift close to 2, indicating that reported problems nearly double the probability of poor ratings.

Recommendation:
Organizations should prioritize early-stage defect detection and quality control, particularly in logistics handling, battery reliability, and hardware durability. Automated review monitoring systems can be deployed to flag recurring issue patterns in real time.

Post-purchase support processes should also be optimized. Faster warranty handling and responsive customer service can reduce dissatisfaction escalation and review-based reputation damage.

Leveraging high-frequency behavior patterns

High-support rules reveal stable global trends such as:

  • {Rating_excellent} → {mentions_quality}

  • {Price_premium} → {mentions_quality}

  • {Price_premium} → {Rating_excellent}

With support values reaching 37%, these patterns represent dominant population-level behavior rather than niche effects.

Recommendation:
E-commerce platforms can integrate these insights into recommendation engines and review summarization systems. Highlighting quality-related excerpts and feature performance summaries for premium products can strengthen perceived value and improve conversion rates.

Dynamic product badges such as “Highly Rated for Quality” or “Premium Performance Verified” can be algorithmically generated using rule-driven signals to guide consumer decision-making.

Strategic summary

Overall, the association-based business insights indicate three core priorities:

  1. Strengthen perceived and actual product quality to maximize excellent reviews
  2. Rapidly detect and mitigate techinacal issues to prevent negative feedback
  3. Align premium pricing strategies with demonstrable performance advantages

With this, organizations can convert unstructured review data into structured decision intelligence that supports improved customer satisfaction, stronger brand reputation, and sustained long-term sales performance.

Conclusion

The results show that smartphone reviews are primarily structured around feature-based evaluation bundles rather than brand identity alone. The strongest associations form a tightly connected cluster linking screen performance, physical design, and battery behavior. When reviewers discuss one of these attributes, they are substantially more likely than expected to discuss the others, as reflected by the highest lift rules.

Positive quality-related language is strongly and frequently associated with excellent ratings, particularly in higher price tiers where customer expectations are more closely aligned with perceived build quality and performance. In contrast, issue-related language emerges as the clearest signal of dissatisfaction. The rule {has_issues} → {Rating_poor} shows a marked increase in the probability of poor ratings when product problems are reported.

The discovered rule structure indicates that customers evaluate smartphones as an integrated product experience rather than as isolated attributes. Hardware-related themes dominate sentiment outcomes, and multi-attribute evaluations drive both praise and criticism patterns.

These findings provide a data-driven basis for prioritizing product improvement efforts, strengthening quality assurance processes, and deploying early warning systems for negative feedback detection. The results demonstrate the value of association rule mining as a practical and interpretable framework for extracting behavioral structure from large-scale review data.

References

https://www.kaggle.com/datasets/grikomsn/amazon-cell-phones-reviews

https://www.powerreviews.com/power-of-reviews-survey-2021/

https://knowledge.insead.edu/marketing/how-negative-reviews-affect-online-consumers