1. INTRODUCTORY STEPS

We first setup the necessary packages for our analysis.

# Set the CRAN mirror:
local({r <- getOption("repos")
r["CRAN"] <- "https://cran.rstudio.com/"
options(repos = r)})

# Install the packages used in this tutorial:
packages <- c("cld2", "quanteda", "seededlda", "sentimentr")

for (i in packages) {
    if(!require(i, character.only = TRUE)) {
        install.packages(i, dependencies = TRUE)
    }
}

## Loading required package: cld2

## Loading required package: quanteda

## Package version: 4.3.0
## Unicode version: 14.0
## ICU version: 71.1

## Parallel computing: disabled

## See https://quanteda.io for tutorials and examples.

## Loading required package: seededlda

## 
## Attaching package: 'seededlda'

## The following object is masked from 'package:quanteda':
## 
##     info_tbb

## The following object is masked from 'package:stats':
## 
##     terms

## Loading required package: sentimentr

Then, we can load and read the two data sets, the first regarding the BIC Cristal and the second regarding the ReMarkable 2 pen.

bic_cristal_reviews <- read.csv("bic_cristal_reddit2.csv")
remarkable2_pen_reviews <- read.csv("remarkable2_pen_reddit2.csv")

library(wordcloud)

## Loading required package: RColorBrewer

library(RColorBrewer)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(tidytext)
library(quanteda)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(scales)

2. DATA EXPLORATION AND PREPARATION

At this point, we can start to explore, clean and prepare our two data sets, lets start by exploring their structure. First for the BIC Cristal data set.

str(bic_cristal_reviews)

## 'data.frame':    991 obs. of  4 variables:
##  $ review_id: chr  "BIC-Cristal-85defa31e5c21f8ac177a3a1225e6562" "BIC-Cristal-de08012735b270c8e66e88439766a426" "BIC-Cristal-de08012735b270c8e66e88439766a426" "BIC-Cristal-244b7c560c061d2b01a68b461841f1d2" ...
##  $ date     : chr  "2024-07-01" "2021-04-06" "2021-04-07" "2024-03-08" ...
##  $ review   : chr  "Bic Wide Body, but without the grip. They also make variants that have a thin grip and a thicker grip. As far a"| __truncated__ "Ah yes, Cristal Re'New, the refillable Cristal. I find it fascinatingly amusing, since the whole point of Crist"| __truncated__ "I remember seeing a silver one a Tiffany years ago but I chew my bics so I didn\031t get it to save my teeth" "Oh okay, that makes much more sense - I think you mean the ones that look like the Bic round stic pens, correct"| __truncated__ ...
##  $ product  : chr  "BIC-Cristal" "BIC-Cristal" "BIC-Cristal" "BIC-Cristal" ...

Then, for the ReMarkable 2 pen.

str(remarkable2_pen_reviews)

## 'data.frame':    738 obs. of  4 variables:
##  $ review_id: chr  "Remarkable-2-f3698a99ab8c3c5174820d281a1d36a8" "Remarkable-2-b7cb401d5ad7cac7e9747bed29dc05c8" "Remarkable-2-b7cb401d5ad7cac7e9747bed29dc05c8" "Remarkable-2-b7cb401d5ad7cac7e9747bed29dc05c8" ...
##  $ date     : chr  "2022-12-10" "2023-02-01" "2024-03-21" "2023-02-03" ...
##  $ review   : chr  "Yes the code is the same. I believe that what happens when the command is executed, is that the script checks w"| __truncated__ "I\031m not familiar with the augmented paper set, but maybe yes, if your pen body is Starwalker Fineliner/Rolle"| __truncated__ "Hello!! So many Google searches later and this is the thread that keeps coming up. I\031m hoping someone out th"| __truncated__ "I have the StarWalker ERM part and that\031s what you would need. Sometimes you can find one on eBay. I happen "| __truncated__ ...
##  $ product  : chr  "Remarkable-2" "Remarkable-2" "Remarkable-2" "Remarkable-2" ...

Now, we need to check if the ‘review_id’ variable can be used as a UID for the different reviews in the data sets.

length(unique(bic_cristal_reviews$review_id))

## [1] 188

length(unique(remarkable2_pen_reviews$review_id))

## [1] 177

As we can observe, the length of the vector containing unique values of the ‘review_id’ variable doesn’t match the number of observations in the data sets, this means that a quite big number of different reviews has the same ‘review_id’ value, thus, we need to create a different variable that can be used as a UID. To do so, since each row contains a unique document, in the following chunk we will use the ‘rownames’ function. Then, to ensure consistency in the data sets, we convert the UID to a character.

bic_cristal_reviews$UID <- row.names(bic_cristal_reviews)
remarkable2_pen_reviews$UID <- row.names(remarkable2_pen_reviews)

# Convert UID to a character varable
bic_cristal_reviews$UID <- as.character(bic_cristal_reviews$UID)
remarkable2_pen_reviews$UID <- as.character(remarkable2_pen_reviews$UID)

Now, since we observed that the variable ‘review_id’ is not useful to identify the reviews, we can remove it from both data sets. Then, we move the ‘UID’ column as the first one.

# Remove 'review_id' column from both datasets
bic_cristal_reviews$review_id <- NULL
remarkable2_pen_reviews$review_id <- NULL

# Reorder columns to move 'UID' to the first position
bic_cristal_reviews <- bic_cristal_reviews[, c("UID", setdiff(names(bic_cristal_reviews), "UID"))]
remarkable2_pen_reviews <- remarkable2_pen_reviews[, c("UID", setdiff(names(remarkable2_pen_reviews), "UID"))]

# Confirm removal and column order
names(bic_cristal_reviews)

## [1] "UID"     "date"    "review"  "product"

names(remarkable2_pen_reviews)

## [1] "UID"     "date"    "review"  "product"

At this point, we remove observations for which the ‘review’ variable contains a missing value (if there’s any).

bic_cristal_reviews <- bic_cristal_reviews[ !is.na(bic_cristal_reviews$review), ]
remarkable2_pen_reviews <- remarkable2_pen_reviews[ !is.na(remarkable2_pen_reviews$review), ]

Moreover, since for sentiment analysis the tools are language specific, we verify that all reviews are in English, and, if they are not, we remove the ones written in other languages

# For the BIC Cristal
bic_cristal_reviews$language <- cld2::detect_language(bic_cristal_reviews$review)
head(cbind(table(bic_cristal_reviews$language)), 20)

##    [,1]
## en  977

# For the ReMarkable 2 pen
remarkable2_pen_reviews$language <- cld2::detect_language(remarkable2_pen_reviews$review)
head(cbind(table(remarkable2_pen_reviews$language)), 20)

##    [,1]
## en  738

As we can see, all reviews seem to be written in English, but for the BIC data set, it looks like 6 out of the 185 reviews cannot be classified, probably because they contain elements that disturb the functioning of the ‘cld2’ function (e.g. numeric values, symbols,..), furthermore, false negatives might exist.

Now, we remove (if there’s any) the characters that are not in the standard ASCII range. This cleans the texts of the reviews and makes them more suitable to be elaborated by a computer. We start by marking those reviews that contain non-ASCII characters, then, if needed, we find-and-replace those characters with ones that are in the ASCII range.

# BIC Cristal
bic_cristal_reviews$non_ascii <- grepl("[^\x01-\x7F]", bic_cristal_reviews$review) # Mark reviews of interest  
bic_cristal_reviews[ which(bic_cristal_reviews$non_ascii %in% TRUE)[30], ]$review

## [1] NA

# ReMarkable 2 pen
remarkable2_pen_reviews$non_ascii <- grepl("[^\x01-\x7F]", remarkable2_pen_reviews$review) # Mark reviews of interest  
remarkable2_pen_reviews[ which(remarkable2_pen_reviews$non_ascii %in% TRUE)[30], ]$review

## [1] NA

As we can see, there’s no reviews that contain non-ASCII characters.

3. MODEL PREPARATION

The first step is creating corpus for both data set, BIC Cristal and ReMarkable 2 pen. A corpus is a collection of texts that we can analyze. In this case, we will create a corpus for each data set using the ‘quanteda’ package. We will use the ‘UID’ variable as the document identifier and the ‘review’ variable as the text field.

#BIC Cristal
corpusbic <- quanteda::corpus(bic_cristal_reviews, 
                           docid_field = "UID",
                           text_field = "review")

summary(corpusbic, 10)

## Corpus consisting of 991 documents, showing 10 documents:
## 
##  Text Types Tokens Sentences       date     product language non_ascii
##     1    69     93         5 2024-07-01 BIC-Cristal       en     FALSE
##     2    22     28         2 2021-04-06 BIC-Cristal       en     FALSE
##     3    21     25         1 2021-04-07 BIC-Cristal       en     FALSE
##     4    35     44         2 2024-03-08 BIC-Cristal       en     FALSE
##     5    55     75         3 2024-05-23 BIC-Cristal       en     FALSE
##     6    24     31         1 2024-05-23 BIC-Cristal       en     FALSE
##     7    29     32         3 2024-06-04 BIC-Cristal       en     FALSE
##     8   104    175         6 2023-02-05 BIC-Cristal       en     FALSE
##     9    72    102         5 2023-02-05 BIC-Cristal       en     FALSE
##    10    51     75         3 2024-08-31 BIC-Cristal       en     FALSE

#Remarkable 2 Pen
corpusRM <- quanteda::corpus(remarkable2_pen_reviews, 
                           docid_field = "UID",
                           text_field = "review")

summary(corpusRM, 10)

## Corpus consisting of 738 documents, showing 10 documents:
## 
##  Text Types Tokens Sentences       date      product language non_ascii
##     1    27     36         2 2022-12-10 Remarkable-2       en     FALSE
##     2    55     78         3 2023-02-01 Remarkable-2       en     FALSE
##     3   100    145         9 2024-03-21 Remarkable-2       en     FALSE
##     4    39     51         4 2023-02-03 Remarkable-2       en     FALSE
##     5    38     48         1 2023-02-03 Remarkable-2       en     FALSE
##     6   108    150         9 2022-01-29 Remarkable-2       en     FALSE
##     7    32     36         2 2020-08-02 Remarkable-2       en     FALSE
##     8    86    121         6 2022-09-03 Remarkable-2       en     FALSE
##     9    31     40         2 2022-09-03 Remarkable-2       en     FALSE
##    10    20     25         4 2022-09-04 Remarkable-2       en     FALSE

To check our corpus, we can randomly draw a review from each corpus and print it.

as.character(corpusbic[ sample(length(corpusbic), 1) ])

##                                                                                                                            387 
## "Fuck Bic their felt tip pens have no colour and their ball points are unreliable have terrible colour and are too sharp even"

as.character(corpusRM[ sample(length(corpusRM), 1) ])

##                                                                    517 
## "Unfortunately I\031m selling the rM2, so I don\031t want to open it."

Now, we can proceed to the sentiment analysis. We will use the ‘sentimentr’ package to calculate the sentiment scores for each review in both corpora. The sentiment scores will be stored in a new variable called ‘ave_sentiment’ in the document variables of each corpus. Furthermore, a subjectivity score is introduced (ranging from [0,1]) to the determine how subjective is the review, with a value close to zero indicating high objectivity and a value close to one indicating high subjectivity.

library(sentimentr)

# Enhanced sentiment analysis
analyze_sentiment <- function(corpus) {
    sentences <- get_sentences(as.character(corpus))
    sentiment <- sentiment_by(sentences)
    
    # Robust subjectivity calculation with diagnostics
    subjectivity <- lapply(sentences, function(x) {
        tryCatch({
            sentiment_detail <- sentimentr::sentiment(get_sentences(x))
            
            # Handle empty results
            if (nrow(sentiment_detail) == 0) return(0)
            
            # Calculate true subjectivity
            abs_weight_sum <- sum(abs(sentiment_detail$sentiment), na.rm = TRUE)
            word_count <- sum(sentiment_detail$word_count, na.rm = TRUE)
            
            # Prevent division by zero
            if (word_count == 0) return(0)
            
            subjectivity_score <- abs_weight_sum / word_count
            
            # Apply boost for opinion indicators
            opinion_indicators <- c("think", "believe", "feel", "opinion", "seem", "appear")
            if (any(opinion_indicators %in% unlist(get_sentences(x)))) {
                subjectivity_score <- min(1, subjectivity_score * 1.5)
            }
            
            return(subjectivity_score)
        }, error = function(e) {
            message("Error processing: ", x)
            0  # Return 0 on error
        })
    })
    
    subjectivity <- unlist(subjectivity)
    
    # Apply non-linear scaling to enhance differentiation
    scaled_subjectivity <- subjectivity^0.5  # Square root scaling
    
    return(list(sentiment = sentiment, subjectivity = scaled_subjectivity))
}

# BIC Cristal Sentiment + Subjectivity
bic_analysis <- analyze_sentiment(corpusbic)
docvars(corpusbic, field = "ave_sentiment") <- bic_analysis$sentiment$ave_sentiment
docvars(corpusbic, field = "subjectivity") <- bic_analysis$subjectivity

# Remarkable 2 Pen Sentiment + Subjectivity
rm_analysis <- analyze_sentiment(corpusRM)
docvars(corpusRM, field = "ave_sentiment") <- rm_analysis$sentiment$ave_sentiment
docvars(corpusRM, field = "subjectivity") <- rm_analysis$subjectivity

# Add to original data frames
bic_cristal_reviews$polarity <- bic_analysis$sentiment$ave_sentiment
bic_cristal_reviews$subjectivity <- bic_analysis$subjectivity
remarkable2_pen_reviews$polarity <- rm_analysis$sentiment$ave_sentiment
remarkable2_pen_reviews$subjectivity <- rm_analysis$subjectivity

# Handle any remaining NAs (shouldn't occur, but safeguard)
bic_cristal_reviews$subjectivity[is.na(bic_cristal_reviews$subjectivity)] <- 0
remarkable2_pen_reviews$subjectivity[is.na(remarkable2_pen_reviews$subjectivity)] <- 0

Now we can verify this sentiment analysis, this is achieved in the following chunk.

# Verify new columns
cat("BIC Cristal columns:", names(bic_cristal_reviews), "\n")

## BIC Cristal columns: UID date review product language non_ascii polarity subjectivity

cat("ReMarkable 2 Pen columns:", names(remarkable2_pen_reviews), "\n")

## ReMarkable 2 Pen columns: UID date review product language non_ascii polarity subjectivity

# Summary of new metrics
cat("\nBIC Cristal Sentiment & Subjectivity:\n")

## 
## BIC Cristal Sentiment & Subjectivity:

print(summary(bic_cristal_reviews[, c("polarity", "subjectivity")]))

##     polarity          subjectivity    
##  Min.   :-0.656284   Min.   :0.00000  
##  1st Qu.: 0.008418   1st Qu.:0.07812  
##  Median : 0.130199   Median :0.11027  
##  Mean   : 0.138470   Mean   :0.11117  
##  3rd Qu.: 0.240808   3rd Qu.:0.14137  
##  Max.   : 0.998008   Max.   :0.34490

cat("\nReMarkable 2 Pen Sentiment & Subjectivity:\n")

## 
## ReMarkable 2 Pen Sentiment & Subjectivity:

print(summary(remarkable2_pen_reviews[, c("polarity", "subjectivity")]))

##     polarity        subjectivity    
##  Min.   :-0.5207   Min.   :0.00000  
##  1st Qu.: 0.0000   1st Qu.:0.08378  
##  Median : 0.1091   Median :0.11110  
##  Mean   : 0.1180   Mean   :0.11365  
##  3rd Qu.: 0.2246   3rd Qu.:0.14387  
##  Max.   : 1.1360   Max.   :0.29446

# Plot distributions
par(mfrow = c(2, 2))
hist(bic_cristal_reviews$polarity, main = "BIC Polarity", xlab = "Sentiment", col = "lightblue")
hist(bic_cristal_reviews$subjectivity, main = "BIC Subjectivity", xlab = "Subjectivity", col = "lightgreen")
hist(remarkable2_pen_reviews$polarity, main = "ReMarkable Polarity", xlab = "Sentiment", col = "lightblue")
hist(remarkable2_pen_reviews$subjectivity, main = "ReMarkable Subjectivity", xlab = "Subjectivity", col = "lightgreen")

par(mfrow = c(1, 1))

As we can see, BIC Cristal reviews are more objective, on average, compared to ReMarkable pen reviews. Then, we can also check if the length of the sentiment scores (polarity and subjectivity) matches the number of documents in each corpus.

# Verify sentiment vector lengths
cat("BIC Cristal Sentiment Verification:\n")

## BIC Cristal Sentiment Verification:

cat(" - Corpus documents:", ndoc(corpusbic), "\n")

##  - Corpus documents: 991

cat(" - Sentiment vector in corpus:", length(docvars(corpusbic, "ave_sentiment")), "\n")

##  - Sentiment vector in corpus: 991

cat(" - Polarity in data frame:", length(bic_cristal_reviews$polarity), "\n")

##  - Polarity in data frame: 991

cat("\nReMarkable 2 Pen Sentiment Verification:\n")

## 
## ReMarkable 2 Pen Sentiment Verification:

cat(" - Corpus documents:", ndoc(corpusRM), "\n")

##  - Corpus documents: 738

cat(" - Sentiment vector in corpus:", length(docvars(corpusRM, "ave_sentiment")), "\n")

##  - Sentiment vector in corpus: 738

cat(" - Polarity in data frame:", length(remarkable2_pen_reviews$polarity), "\n")

##  - Polarity in data frame: 738

# Proper equality checks
bic_sent_ok <- length(docvars(corpusbic, "ave_sentiment")) == ndoc(corpusbic)
rm_sent_ok <- length(docvars(corpusRM, "ave_sentiment")) == ndoc(corpusRM)

cat("\nResults:\n")

## 
## Results:

cat("BIC sentiment length matches documents:", bic_sent_ok, "\n")

## BIC sentiment length matches documents: TRUE

cat("ReMarkable sentiment length matches documents:", rm_sent_ok, "\n")

## ReMarkable sentiment length matches documents: TRUE

The results are TRUE, meaning that the sentiment scores have been successfully calculated and stored in the document variables of each corpus.

As we can see, BIC Cristal has a skewed right histogram, indicating tendency for positive sentiment. While Remarkable 2 Pen has a more balanced distribution, with a slight tendency for positive sentiment as well.

Let’s explore the sentiment scores further by checking the minimum and maximum values, as well as the mean and median.

#BIC Cristal
summary(quanteda::docvars(corpusbic)$ave_sentiment)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.656284  0.008418  0.130199  0.138470  0.240808  0.998008

#Remarkable 2 Pen
summary(quanteda::docvars(corpusRM)$ave_sentiment)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.5207  0.0000  0.1091  0.1180  0.2246  1.1360

The value indicates that sentiment for both are positive, Let’s explore sentiments with other method.

#BIC Cristal
as.character(corpusbic[ which(quanteda::docvars(corpusbic)$ave_sentiment == summary(quanteda::docvars(corpusbic)$ave_sentiment)[3]), ])

##                                                                                                       83 
## "I own this pen. It is the only Montblanc I own. I think the Visconti Homo sapiens ballpoint is better." 
##                                                                                                      379 
##                                            "I've never heard of a Z4, is it new?!?!?! Ahhh!! Bic rules."

as.character(corpusbic[ which(quanteda::docvars(corpusbic)$ave_sentiment == summary(quanteda::docvars(corpusbic)$ave_sentiment)[5]), ])

## named character(0)

#Remarkable 2 Pen
as.character(corpusRM[ which(quanteda::docvars(corpusRM)$ave_sentiment == summary(quanteda::docvars(corpusRM)$ave_sentiment)[1]), ])

##                                                                                                          281 
## "I was concerned a harder stylus would be more likely to scratch the screen. How is your screen holding up?"

as.character(corpusRM[ which(quanteda::docvars(corpusRM)$ave_sentiment == summary(quanteda::docvars(corpusRM)$ave_sentiment)[5]), ])

## named character(0)

Now, we can proceed to the feature extraction step. We will tokenize the text in each corpus, which means breaking it down into individual words or tokens. We will also remove punctuation, symbols, numbers, URLs, and stopwords from the tokens. Then, we will create a document-feature matrix (DFM) for each corpus.

#BIC Cristal
tokens_bic <- quanteda::tokens(corpusbic, 
                           what = "word", # change the keyword
                           remove_punct = TRUE,
                           remove_symbols = TRUE,
                           remove_numbers = TRUE,
                           remove_url = TRUE,
                           remove_separators = TRUE,
                           split_hyphens = FALSE,
                           split_tags = FALSE,
                           include_docvars = TRUE,
                           padding = FALSE,
                           verbose = TRUE)

## Creating a tokens from a corpus object...

##  ...starting tokenization

##  ...tokenizing 1 of 1 blocks

##  ...preserving hyphens

##  ...preserving elisions

##  ...preserving social media tags (#, @)

##  ...removing separators, punctuation, symbols, numbers, URLs

##  ...5,761 unique types

##  ...complete, elapsed time: 0.155 seconds.

## Finished constructing tokens from 991 documents

#Remarkable 2 Pen
tokens_RM <- quanteda::tokens(corpusRM, 
                           what = "word", # change the keyword
                           remove_punct = TRUE,
                           remove_symbols = TRUE,
                           remove_numbers = TRUE,
                           remove_url = TRUE,
                           remove_separators = TRUE,
                           split_hyphens = FALSE,
                           split_tags = FALSE,
                           include_docvars = TRUE,
                           padding = FALSE,
                           verbose = TRUE)

## Creating a tokens from a corpus object...

##  ...starting tokenization

##  ...tokenizing 1 of 1 blocks

##  ...preserving hyphens

##  ...preserving elisions

##  ...preserving social media tags (#, @)

##  ...removing separators, punctuation, symbols, numbers, URLs

##  ...4,217 unique types

##  ...complete, elapsed time: 0.125 seconds.

## Finished constructing tokens from 738 documents

The tokens() function takes a corpus as its input, cleans each document and generates a vector of processed words per document. We can explore the resulting tokens object, for example, by sampling one of its elements:

tokens_bic[ sample(length(tokens_bic), 1) ]

## Tokens consisting of 1 document and 6 docvars.
## 414 :
##  [1] "Best"    "Clicker" "Of"      "All"     "Time"    "Prove"   "me"     
##  [8] "wrong"   "I"       "am"      "holding" "out"    
## [ ... and 14 more ]

tokens_RM[ sample(length(tokens_RM), 1) ]

## Tokens consisting of 1 document and 6 docvars.
## 446 :
##  [1] "My"     "marker" "fall"   "and"    "the"    "cone"   "around" "the"   
##  [9] "tip"    "was"    "broken" "too"   
## [ ... and 19 more ]

We now have vectors per review that contain the extracted and preprocessed words per review. Next, we can explore the tokens in each corpus. We can check the first few tokens in each corpus. We can check certain keywords in context, for example, the word “nice” in the BIC Cristal corpus. Also keywords in context for the ReMarkable 2 pen corpus, for example, the word “good”. We can also convert the tokens to lowercase to ensure consistency in the analysis.

#BIC Cristal
data.frame(quanteda::kwic(tokens_bic, pattern = "nice"))[ c(1:5), c(4:6) ] #Change the word nice

##                             pre keyword                          post
## 1                          Very    nice    modification M I'm glad to
## 2        experience I do like a    nice     pen with the right refill
## 3            it pen into a very    nice    writer by changing the ink
## 4     happy if someone says ooh    nice     pen and pretty happy even
## 5 investment but just like with    nice watches you'll always be able

tokens_bic <- quanteda::tokens_tolower(tokens_bic)
#Remarkable 2 Pen
data.frame(quanteda::kwic(tokens_RM, pattern = "good"))[ c(1:5), c(4:6) ] #Change the word good

##                                       pre keyword
## 1                i think these are really    good
## 2                   the long run But it's    good
## 3                   best of both worlds a    good
## 4 Samsung's stylus-enabled line is pretty    good
## 5                    some of them will be    good
##                             post
## 1   to write with nice stainless
## 2       to know that this exists
## 3 pen and the writing experience
## 4        You won't get e-ink but
## 5                and I can get a

tokens_RM <- quanteda::tokens_tolower(tokens_RM)

Now, we can remove stopwords from the tokens. Stopwords are common words that do not carry much meaning, such as “the”, “and”, “is”, etc. We will use the English stopwords list provided by the quanteda package.

#BIC Cristal
quanteda::stopwords("en")

##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"       "will"

tokens_bic <- quanteda::tokens_remove(tokens_bic, pattern = stopwords("en"))
#Remarkable 2 Pen
quanteda::stopwords("en")

##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"       "will"

tokens_RM <- quanteda::tokens_remove(tokens_RM, pattern = stopwords("en"))

We have removed the stopwords from the tokens, now, before creating dfm for each corpus, we can do lemmatization. In this way, we are able to reduce words to their basic form while considering their meaning based on their part in the speech. To do so, we first install the required packages.

# Install additional required packages
packages <- c(packages, "textstem", "lexicon")  # Add new packages
for (i in packages) {
    if(!require(i, character.only = TRUE)) {
        install.packages(i, dependencies = TRUE)
    }
}

## Loading required package: textstem

## Loading required package: koRpus.lang.en

## Loading required package: koRpus

## Loading required package: sylly

## For information on available language packages for 'koRpus', run
## 
##   available.koRpus.lang()
## 
## and see ?install.koRpus.lang()

## 
## Attaching package: 'koRpus'

## The following objects are masked from 'package:quanteda':
## 
##     tokens, types

## Loading required package: lexicon

## 
## Attaching package: 'lexicon'

## The following object is masked from 'package:sentimentr':
## 
##     available_data

library(textstem)

Now we can actually do lemmization.

# Function to lemmatize tokens while preserving POS context
lemmatize_tokens <- function(tokens) {
    # Convert tokens to list for processing
    token_list <- as.list(tokens)
    
    # Lemmatize each document
    lemmatized <- lapply(token_list, function(x) {
        if (length(x) == 0) return(x)  # Handle empty documents
        lemmas <- textstem::lemmatize_words(x, dictionary = lexicon::hash_lemmas)
        return(lemmas)
    })
    
    # Convert back to tokens object
    new_tokens <- quanteda::as.tokens(lemmatized)
    quanteda::docvars(new_tokens) <- quanteda::docvars(tokens)
    return(new_tokens)
}

# Create copies of tokens before lemmatization
tokens_bic_orig <- tokens_bic
tokens_RM_orig <- tokens_RM

# Apply lemmatization to both token sets
tokens_bic <- lemmatize_tokens(tokens_bic)
tokens_RM <- lemmatize_tokens(tokens_RM)

Now we add a verification step to check results.

# Check lemmatization package functionality
cat("\nLEMMATIZATION PACKAGE CHECK:\n")

## 
## LEMMATIZATION PACKAGE CHECK:

test_words <- c("running", "ran", "runs", "better", "best", "is", "was")
test_lemmas <- textstem::lemmatize_words(test_words, dictionary = lexicon::hash_lemmas)
data.frame(Word = test_words, Lemma = test_lemmas)

##      Word Lemma
## 1 running   run
## 2     ran   run
## 3    runs   run
## 4  better  good
## 5    best  good
## 6      is    be
## 7     was    be

It works as desired. Now, we create two new data sets, one for each product, containing the cleaned reviews resulting from this previous section; each data set will contain 6 columns:

library(dplyr)
library(tibble)

# Reconstruct cleaned (lemmatized) sentences from token lists
cleaned_bic <- sapply(tokens_bic, paste, collapse = " ")
cleaned_rm  <- sapply(tokens_RM,  paste, collapse = " ")

# Create analysis data set for BIC Cristal
bic_reviews_cleaned <- tibble(
  UID = paste0("BIC_", seq_along(cleaned_bic)),
  Date = bic_cristal_reviews$date,
  Old_Comment = bic_cristal_reviews$review,
  Cleaned_Comment = cleaned_bic,
  `Polarity score` = bic_cristal_reviews$polarity,
  `Subjectivity score` = bic_cristal_reviews$subjectivity
)

# Create analysis data set for ReMarkable 2 Pen
rm_reviews_cleaned <- tibble(
  UID = paste0("RM_", seq_along(cleaned_rm)),
  Date = remarkable2_pen_reviews$date,
  Old_Comment = remarkable2_pen_reviews$review,
  Cleaned_Comment = cleaned_rm,
  `Polarity score` = remarkable2_pen_reviews$polarity,
  `Subjectivity score` = remarkable2_pen_reviews$subjectivity
)

# Save as csv
write.csv(bic_reviews_cleaned, "bic_cristal_cleaned_data.csv", row.names = FALSE)
write.csv(rm_reviews_cleaned, "remarkable_pen_cleaned_data.csv", row.names = FALSE)

Let’s look at the first observations of these new data sets.

head(bic_reviews_cleaned)

## # A tibble: 6 × 6
##   UID   Date   Old_Comment Cleaned_Comment `Polarity score` `Subjectivity score`
##   <chr> <chr>  <chr>       <chr>                      <dbl>                <dbl>
## 1 BIC_1 2024-… "Bic Wide … bic wide body …          -0.129                0.0703
## 2 BIC_2 2021-… "Ah yes, C… ah yes cristal…           0.289                0.158 
## 3 BIC_3 2021-… "I remembe… remember see s…           0.1                  0.0632
## 4 BIC_4 2024-… "Oh okay, … oh okay make m…           0.427                0.152 
## 5 BIC_5 2024-… "I cut a B… cut bic crista…           0.117                0.0908
## 6 BIC_6 2024-… "You can m… can make bic r…           0.0743               0.0506

head(rm_reviews_cleaned)

## # A tibble: 6 × 6
##   UID   Date   Old_Comment Cleaned_Comment `Polarity score` `Subjectivity score`
##   <chr> <chr>  <chr>       <chr>                      <dbl>                <dbl>
## 1 RM_1  2022-… "Yes the c… yes code belie…           0.178                0.0995
## 2 RM_2  2023-… "I\u0019m … be familiar au…           0.171                0.117 
## 3 RM_3  2024-… "Hello!! S… hello many goo…           0.187                0.143 
## 4 RM_4  2023-… "I have th… starwalker erm…           0.0715               0.0979
## 5 RM_5  2023-… "The StarW… starwalker com…           0.0564               0.0479
## 6 RM_6  2022-… "&gt;I had… gt pebble watc…          -0.141                0.109

Now we can proceed with the document-feature matrix creation. A document-feature matrix (DFM) is a matrix that represents the frequency of each word (feature) in each document. We will use the quanteda package to create the DFM.

#BIC Cristal
dfm_bic <- quanteda::dfm(tokens_bic)
dfm_bic

## Document-feature matrix of: 991 documents, 3,957 features (99.42% sparse) and 6 docvars.
##     features
## docs bic wide body without grip also make variant thin thick
##    1   2    1    1       1    3    1    2       1    1     1
##    2   0    0    0       0    0    0    0       0    0     0
##    3   0    0    0       0    0    0    0       0    0     0
##    4   1    0    0       0    0    0    1       0    0     0
##    5   1    0    0       0    0    0    0       0    0     0
##    6   1    0    1       0    2    0    1       0    0     0
## [ reached max_ndoc ... 985 more documents, reached max_nfeat ... 3,947 more features ]

topfeatures(dfm_bic)

##       pen       bic    refill ballpoint       use       ink      good   cristal 
##      1160       886       402       368       367       355       333       313 
##      like       one 
##       307       281

#Remarkable 2 Pen
dfm_RM <- quanteda::dfm(tokens_RM)
dfm_RM

## Document-feature matrix of: 738 documents, 2,767 features (99.23% sparse) and 6 docvars.
##     features
## docs yes code believe happen command execute script check firmware apply
##    1   1    1       1      1       1       1      1     1        2     1
##    2   1    0       0      0       0       0      0     0        0     0
##    3   0    0       0      0       0       0      0     0        0     0
##    4   0    0       0      2       0       0      0     0        0     0
##    5   0    0       0      0       0       0      0     0        0     0
##    6   1    0       0      1       0       0      0     0        0     0
## [ reached max_ndoc ... 732 more documents, reached max_nfeat ... 2,757 more features ]

topfeatures(dfm_RM)

##        pen        use     stylus remarkable       good       work        nib 
##        656        316        248        231        215        211        210 
##       like        one        get 
##        201        186        178

Next we will perform cleaning, but firstly we will check the length of the tokens in each corpus. This will help us to identify any tokens that are too short, which might not be useful for our analysis.

#BIC Cristal
length_features_bic <- data.frame(count = quanteda::topfeatures(dfm_bic, ncol(dfm_bic))) # create data frame 
length_features_bic$length <- nchar(rownames(length_features_bic)) # extract token length
length_features_bic[ which(length_features_bic$length %in% c(1, 2)), ] # filter on token length

##    count length
## s    137      1
## go   126      2
## t     89      1
## be    63      2
## g2    50      2
## re    42      2
## us    32      2
## gt    30      2
## sr    28      2
## mm    21      2
## do    17      2
## mb    16      2
## z4    13      2
## r     12      1
## se    12      2
## th    10      2
## vs     9      2
## f      9      1
## uk     9      2
## u      9      1
## g      9      1
## v5     7      2
## em     7      2
## o      6      1
## oh     6      2
## b      6      1
## c      6      1
## ah     5      2
## og     5      2
## 2f     5      2
## op     5      2
## 2      5      1
## rt     5      2
## d1     4      2
## fp     4      2
## x      4      1
## w      4      1
## r2     4      2
## im     4      2
## 3d     4      2
## xl     4      2
## bp     4      2
## id     3      2
## dr     3      2
## z      3      1
## ti     3      2
## q      3      1
## pm     3      2
## eu     3      2
## up     3      2
## #7     3      2
## et     2      2
## no     2      2
## eh     2      2
## 3a     2      2
## h      2      1
## ai     2      2
## _2     2      2
## tt     2      2
## 5d     2      2
## 0a     2      2
## nz     2      2
## ez     2      2
## p      2      1
## ha     2      2
## ty     2      2
## xb     2      2
## 2c     2      2
## gp     2      2
## lt     1      2
## ms     1      2
## 5k     1      2
## hl     1      2
## sh     1      2
## kg     1      2
## ac     1      2
## qc     1      2
## cv     1      2
## ea     1      2
## bc     1      2
## um     1      2
## pl     1      2
## _t     1      2
## oz     1      2
## 2x     1      2
## k      1      1
## dq     1      2
## pg     1      2
## #v     1      2
## au     1      2
## xf     1      2
## de     1      2
## yr     1      2
## nc     1      2
## a1     1      2
## ct     1      2
## xo     1      2
## tl     1      2
## l      1      1
## eq     1      2
## n2     1      2
## le     1      2
## e      1      1
## á      1      1
## yt     1      2
## v      1      1
## _1     1      2
## dx     1      2
## co     1      2
## st     1      2
## sc     1      2
## v7     1      2
## ûd     1      2
## 5      1      1
## lo     1      2
## ü      1      1
## 2k     1      2
## 6k     1      2
## bu     1      2
## ii     1      2
## ef     1      2
## ps     1      2
## ð      1      1
## pp     1      2

#Remarkable 2 Pen
length_features_RM <- data.frame(count = quanteda::topfeatures(dfm_RM, ncol(dfm_RM))) # create data frame
length_features_RM$length <- nchar(rownames(length_features_RM)) # extract token length
length_features_RM[ which(length_features_RM$length %in% c(1, 2)), ] # filter on token length

##    count length
## s    117      1
## t     96      1
## rm    94      2
## go    63      2
## be    50      2
## th    27      2
## gt    25      2
## do    16      2
## s6    14      2
## u     12      1
## w     11      1
## re    10      2
## us    10      2
## aa    10      2
## 3      9      1
## x      9      1
## oh     7      2
## 2      6      1
## o      6      1
## 2c     6      2
## c      5      1
## ai     5      2
## 3d     5      2
## pp     5      2
## sr     4      2
## os     4      2
## lt     4      2
## s7     4      2
## y      4      1
## vs     3      2
## l      3      1
## eg     3      2
## al     3      2
## im     3      2
## sh     3      2
## id     3      2
## sc     3      2
## cp     3      2
## sp     3      2
## qp     3      2
## n      3      1
## af     3      2
## k      3      1
## op     2      2
## _1     2      2
## b      2      1
## e      2      1
## ha     2      2
## no     2      2
## 4      2      1
## f      2      1
## a5     2      2
## r      2      1
## z      2      1
## a4     1      2
## a6     1      2
## _b     1      2
## ie     1      2
## eu     1      2
## _8     1      2
## ip     1      2
## sw     1      2
## nv     1      2
## bb     1      2
## r2     1      2
## dr     1      2
## #3     1      2
## ps     1      2
## g      1      1
## ý      1      1
## fr     1      2
## co     1      2
## bc     1      2
## _3     1      2
## up     1      2
## ex     1      2
## hp     1      2
## ui     1      2
## m2     1      2
## ad     1      2
## de     1      2
## bw     1      2
## e3     1      2
## xl     1      2
## og     1      2
## mm     1      2
## 5s     1      2
## ik     1      2
## yr     1      2
## ii     1      2

Based on the result above on both BIC Cristal and Remarkable 2 Pen, we can see that there are no significant concerns, so we can proceed to the next step.

Now we create wordcloud for both brands

#BIC Cristal
# Create a term frequency table
freq <- colSums(dfm_bic)
freq <- sort(freq, decreasing = TRUE)

# Basic word cloud
wordcloud(words = names(freq), freq = freq, min.freq = 5,
          max.words = 100, random.order = FALSE,
          colors = brewer.pal(8, "Dark2"))

#Remarkable 2 Pen
# Create a term frequency table
freq <- colSums(dfm_RM)
freq <- sort(freq, decreasing = TRUE)

# Basic word cloud
wordcloud(words = names(freq), freq = freq, min.freq = 5,
          max.words = 100, random.order = FALSE,
          colors = brewer.pal(8, "Dark2"))

4. TOPIC MODELING

We can now proceed to the topic modeling step. Topic modeling is a technique used to discover the underlying topics in a collection of documents. We will use the seededlda package to perform topic modeling on both corpora.

# BIC Cristal Topic Modeling
textmodel_positive_bic <- seededlda::textmodel_lda(dfm_bic[ docvars(dfm_bic)$ave_sentiment >= 0, ], k = 2)
textmodel_negative_bic <- seededlda::textmodel_lda(dfm_bic[ docvars(dfm_bic)$ave_sentiment < 0, ], k = 2)
# ReMarkable 2 Pen Topic Modeling
textmodel_positive_rm <- seededlda::textmodel_lda(dfm_RM[ docvars(dfm_RM)$ave_sentiment >= 0, ], k = 2)
textmodel_negative_rm <- seededlda::textmodel_lda(dfm_RM[ docvars(dfm_RM)$ave_sentiment < 0, ], k = 2)

5. INTERPRETATION AND EVALUATION

We can now interpret the results of the topic modeling. We can check the top terms for each topic in both corpora.

#BIC Cristal
head(topics(textmodel_positive_bic))

##      2      3      4      5      6      7 
## topic1 topic1 topic1 topic1 topic1 topic1 
## Levels: topic1 topic2

head(topics(textmodel_negative_bic))

##      1     22     29     31     32     37 
## topic1 topic1 topic1 topic1 topic1 topic1 
## Levels: topic1 topic2

table(topics(textmodel_positive_bic))

## 
## topic1 topic2 
##    621    200

table(topics(textmodel_negative_bic))

## 
## topic1 topic2 
##    141     29

#Remarkable 2 Pen
head(topics(textmodel_positive_rm))

##      1      2      3      4      5      8 
## topic2 topic1 topic1 topic1 topic1 topic1 
## Levels: topic1 topic2

head(topics(textmodel_negative_rm))

##      6      7     12     19     23     26 
## topic1 topic2 topic1 topic1 topic2 topic1 
## Levels: topic1 topic2

table(topics(textmodel_positive_rm))

## 
## topic1 topic2 
##    373    199

table(topics(textmodel_negative_rm))

## 
## topic1 topic2 
##    134     32

Based on the result above. in BIC Cristal, most documents are assigned to topic 2 for both positive and negative sentiment, while in ReMarkable 2 Pen, most documents are topic 2 for positive, and topic 1 for negative sentiment. This indicates that the topics are not evenly distributed across the documents.

#BIC Cristal
cbind(seededlda::terms(textmodel_positive_bic, 10), 
      seededlda::terms(textmodel_negative_bic, 10))

##       topic1    topic2      topic1      topic2     
##  [1,] "pen"     "ballpoint" "pen"       "pen"      
##  [2,] "bic"     "pen"       "bic"       "bic"      
##  [3,] "cristal" "amp"       "use"       "jetstream"
##  [4,] "use"     "refill"    "cristal"   "gel"      
##  [5,] "one"     "ink"       "ink"       "amp"      
##  [6,] "good"    "gel"       "refill"    "s"        
##  [7,] "like"    "good"      "one"       "ballpoint"
##  [8,] "just"    "can"       "just"      "ink"      
##  [9,] "write"   "like"      "ballpoint" "try"      
## [10,] "ink"     "write"     "get"       "energel"

#Remarkable 2 Pen
cbind(seededlda::terms(textmodel_positive_rm, 10), 
      seededlda::terms(textmodel_negative_rm, 10))

##       topic1   topic2       topic1       topic2    
##  [1,] "pen"    "pen"        "pen"        "stylus"  
##  [2,] "like"   "amp"        "use"        "review"  
##  [3,] "use"    "work"       "get"        "also"    
##  [4,] "feel"   "remarkable" "nib"        "noris"   
##  [5,] "write"  "good"       "stylus"     "use"     
##  [6,] "tip"    "lamy"       "just"       "emr"     
##  [7,] "stylus" "eraser"     "work"       "go"      
##  [8,] "one"    "use"        "like"       "gt"      
##  [9,] "nib"    "stylus"     "remarkable" "much"    
## [10,] "can"    "marker"     "try"        "fakespot"

# Get topic distribution for each document
topic_bic_pos <- data.frame(
  topic = topics(textmodel_positive_bic),
  sentiment = "Positive",
  brand = "BIC Cristal"
)

topic_bic_neg <- data.frame(
  topic = topics(textmodel_negative_bic),
  sentiment = "Negative",
  brand = "BIC Cristal"
)

topic_rm_pos <- data.frame(
  topic = topics(textmodel_positive_rm),
  sentiment = "Positive",
  brand = "Remarkable 2"
)

topic_rm_neg <- data.frame(
  topic = topics(textmodel_negative_rm),
  sentiment = "Negative",
  brand = "Remarkable 2"
)
# Combine all topic data
topic_all <- bind_rows(topic_bic_pos, topic_bic_neg, topic_rm_pos, topic_rm_neg)
topic_dist <- topic_all %>%
  group_by(brand, sentiment, topic) %>%
  summarise(doc_count = n(), .groups = "drop")
#plot
ggplot(topic_dist, aes(x = factor(topic), y = doc_count, fill = sentiment)) +
  geom_col(position = "dodge") +
  facet_wrap(~ brand) +
  labs(title = "Topic Distribution per Brand",
       x = "Topic", y = "Number of Documents") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set1")

The first two column represent the positive sentiment, while the others are negative sentiment. As we can wee here, the topics in BIC Cristal are related to the quality of the pen, its price, and its availability. While in ReMarkable 2 Pen, the topics are related to the quality of the pen, its price, and its usability. But we notice there are words that might be not meaningful in understanding the customers review such as pen, pens, cristal, etc. We will improve the model performance

6. MODEL IMPROVEMENT

First, removing short review

# BIC Cristal
## Remove documents that contain less than 1 words, I tried for 5, it removes everything
tokens_bic_new <- tokens_bic[ quanteda::ntoken(tokens_bic) >= 2 ]

## Prepare the new dfm (this integrates some of the previous steps)
dfm_bic <- quanteda::dfm(tokens_bic_new)
#dfm_bic <- dfm_trim(dfm, min_termfreq = 3)
#dfm_bic <- dfm[ -which((rowSums(dfm)) %in% 0), ]
## Run the new textmodels
textmodel_positive_bic_new <- seededlda::textmodel_lda(dfm_bic[ docvars(dfm_bic)$ave_sentiment >= 0, ], k = 2)
textmodel_negative_bic_new <- seededlda::textmodel_lda(dfm_bic[ docvars(dfm_bic)$ave_sentiment < 0, ], k = 2)

## Explore the topics
cbind(seededlda::terms(textmodel_positive_bic_new), 
      seededlda::terms(textmodel_negative_bic_new))

##       topic1      topic2    topic1    topic2     
##  [1,] "pen"       "pen"     "bic"     "pen"      
##  [2,] "refill"    "bic"     "cristal" "ink"      
##  [3,] "ballpoint" "cristal" "pen"     "ballpoint"
##  [4,] "amp"       "one"     "make"    "refill"   
##  [5,] "ink"       "use"     "color"   "bic"      
##  [6,] "good"      "like"    "use"     "use"      
##  [7,] "gel"       "good"    "get"     "like"     
##  [8,] "write"     "just"    "stic"    "one"      
##  [9,] "can"       "write"   "good"    "can"      
## [10,] "use"       "make"    "round"   "just"

# ReMarkable 2 Pen
## Remove documents that contain less than 2 words
tokens_RM_new <- tokens_RM[ quanteda::ntoken(tokens_RM) >= 2 ]
## Prepare the new dfm (this integrates some of the previous steps)
dfm_RM <- quanteda::dfm(tokens_RM_new)
#dfm_RM <- dfm_trim(dfm_RM, min_termfreq = 10)
#dfm_RM <- dfm_RM[ -which((rowSums(dfm_RM)) %in% 0), ]
## Run the new textmodels
textmodel_positive_RM_new <- seededlda::textmodel_lda(dfm_RM[ docvars(dfm_RM)$ave_sentiment >= 0, ], k = 2)
textmodel_negative_RM_new <- seededlda::textmodel_lda(dfm_RM[ docvars(dfm_RM)$ave_sentiment < 0, ], k = 2)

## Explore the topics
cbind(seededlda::terms(textmodel_positive_RM_new), 
      seededlda::terms(textmodel_negative_RM_new))

##       topic1       topic2       topic1       topic2    
##  [1,] "pen"        "amp"        "pen"        "stylus"  
##  [2,] "use"        "pen"        "use"        "review"  
##  [3,] "like"       "remarkable" "get"        "also"    
##  [4,] "nib"        "work"       "nib"        "noris"   
##  [5,] "one"        "just"       "just"       "use"     
##  [6,] "feel"       "case"       "stylus"     "emr"     
##  [7,] "good"       "stylus"     "work"       "go"      
##  [8,] "stylus"     "good"       "like"       "gt"      
##  [9,] "remarkable" "buy"        "remarkable" "much"    
## [10,] "write"      "hack"       "tip"        "fakespot"

Trying model adjustment First, let’s check the BIC Cristal model, we will try to adjust the model by changing the number of topics and the parameters of the LDA model. We will also check if the topics make sense and if they are interpretable.

BIC Cristal

##BIC Cristal
textmodel_positive_bic_new <- seededlda::textmodel_lda(dfm_bic[ docvars(dfm_bic)$ave_sentiment >= 0, ],
                                                   k = 2,
                                                   max_iter = 1000,
                                                   alpha = 0.5,
                                                   beta = 0.05) 
textmodel_negative_bic_new <- seededlda::textmodel_lda(dfm_bic[ docvars(dfm_bic)$ave_sentiment < 0, ],
                                                   k = 2,
                                                   max_iter = 1000,
                                                   alpha = 0.5,
                                                   beta = 0.05)
cbind(seededlda::terms(textmodel_positive_bic_new), 
      seededlda::terms(textmodel_negative_bic_new))

##       topic1    topic2      topic1    topic2     
##  [1,] "pen"     "ballpoint" "bic"     "pen"      
##  [2,] "bic"     "refill"    "pen"     "ballpoint"
##  [3,] "cristal" "amp"       "use"     "ink"      
##  [4,] "one"     "pen"       "cristal" "bic"      
##  [5,] "use"     "gel"       "refill"  "write"    
##  [6,] "like"    "ink"       "one"     "find"     
##  [7,] "good"    "good"      "just"    "go"       
##  [8,] "just"    "write"     "get"     "jetstream"
##  [9,] "ink"     "can"       "like"    "old"      
## [10,] "write"   "use"       "ink"     "gel"

Based on model adjustment, no significant changes on the topics, so we can proceed to the next step.Notice, there are still words that are not relevant in understanding the customers review such as pen, pens, cristal, etc. We will improve the model performance by removing those words.

dfm_tmp_bic <- quanteda::dfm_remove(dfm_bic, pattern = c("pen", "also", "pens",
                                                 "just", "ballpoints","find", "Cristal","ballpoint","use",
                                                 "write","writing","bic","s","want","can"  ))

textmodel_positive_bic_new <- seededlda::textmodel_lda(dfm_tmp_bic[ docvars(dfm_tmp_bic)$ave_sentiment >= 0, ], k = 2)
textmodel_negative_bic_new <- seededlda::textmodel_lda(dfm_tmp_bic[ docvars(dfm_tmp_bic)$ave_sentiment < 0, ], k = 2)
cbind(terms(textmodel_positive_bic_new), terms(textmodel_negative_bic_new))

##       topic1  topic2   topic1      topic2  
##  [1,] "amp"   "refill" "jetstream" "ink"   
##  [2,] "buy"   "good"   "gel"       "one"   
##  [3,] "see"   "like"   "refill"    "like"  
##  [4,] "ink"   "ink"    "amp"       "get"   
##  [5,] "make"  "one"    "line"      "refill"
##  [6,] "one"   "gel"    "energel"   "make"  
##  [7,] "color" "smooth" "uni"       "color" 
##  [8,] "sell"  "get"    "need"      "good"  
##  [9,] "price" "love"   "always"    "buy"   
## [10,] "go"    "try"    "lot"       "know"

# Extract top words per topic
positive_terms_bic <- terms(textmodel_positive_bic_new, 15)
negative_terms_bic <- terms(textmodel_negative_bic_new, 15)

# Convert matrices into data frames
positive_df_bic <- data.frame(
  word = as.vector(positive_terms_bic),
  topic = rep(1:ncol(positive_terms_bic), each = nrow(positive_terms_bic)),
  sentiment = "Positive"
)

negative_df_bic <- data.frame(
  word = as.vector(negative_terms_bic),
  topic = rep(1:ncol(negative_terms_bic), each = nrow(negative_terms_bic)),
  sentiment = "Negative"
)

# Combine positive & negative topics
topics_combined_bic <- bind_rows(positive_df_bic, negative_df_bic)

# Count occurrences of words per topic
word_counts_bic <- topics_combined_bic %>%
  group_by(topic, word, sentiment) %>%
  summarise(count = n(), .groups = "drop")

# Plot: sorted within each facet, to understand word and sentiment mapping
ggplot(word_counts_bic, aes(x = reorder_within(word, count, topic), y = count, fill = sentiment)) +
  geom_col(show.legend = TRUE) +
  facet_wrap(~ topic + sentiment, scales = "free_y") +
  coord_flip() +
  scale_x_reordered() +  # Needed to handle reordered_within
  labs(title = "Top Words per Topic and Sentiment",
       x = "Word", y = "Sentiment") +
    theme_minimal()

ggsave("top_words_per_topic_and_sentiment.png", width = 10, height = 42, limitsize = FALSE)

# Create a term frequency table
freq <- colSums(dfm_tmp_bic)
freq <- sort(freq, decreasing = TRUE)

# Basic word cloud
wordcloud(words = names(freq), freq = freq, min.freq = 5,
          max.words = 50, random.order = FALSE,
          colors = brewer.pal(8, "Dark2"),
          scale = c(4, 0.8))

Now, we can check the ReMarkable 2 pen model, we will try to adjust the model by changing the number of topics and the parameters of the LDA model. We will also check if the topics make sense and if they are interpretable.

Remarkable 2 Pen

##Remarkable 2 Pen
textmodel_positive_RM_new <- seededlda::textmodel_lda(dfm_RM[ docvars(dfm_RM)$ave_sentiment >= 0, ],
                                                   k = 2,
                                                   max_iter = 1000,
                                                   alpha = 0.5,
                                                   beta = 0.05) 
textmodel_negative_RM_new <- seededlda::textmodel_lda(dfm_RM[ docvars(dfm_RM)$ave_sentiment < 0, ],
                                                   k = 2,
                                                   max_iter = 1000,
                                                   alpha = 0.5,
                                                   beta = 0.05)
cbind(seededlda::terms(textmodel_positive_RM_new), 
      seededlda::terms(textmodel_negative_RM_new))

##       topic1       topic2       topic1       topic2  
##  [1,] "pen"        "amp"        "pen"        "stylus"
##  [2,] "use"        "remarkable" "get"        "use"   
##  [3,] "like"       "work"       "nib"        "review"
##  [4,] "nib"        "pen"        "use"        "noris" 
##  [5,] "good"       "case"       "work"       "good"  
##  [6,] "feel"       "just"       "like"       "also"  
##  [7,] "one"        "stylus"     "remarkable" "emr"   
##  [8,] "remarkable" "want"       "tip"        "gt"    
##  [9,] "stylus"     "good"       "stylus"     "want"  
## [10,] "write"      "note"       "rm2"        "happen"

Based on model adjustment, no significant changes on the topics, so we can proceed to the next step. Notice, there are still words that are not relevant in understanding the customers review such as pen, pens, remarkable, etc. We will improve the model performance by removing those words.

dfm_tmp_RM <- quanteda::dfm_remove(dfm_RM, pattern = c("pen", "also", "pens",
                                                 "just", "much","find", "lamy","rm","rm2","see","really","writing","s","want","can","
                                                 pencil","stylus","nib","nibs","back","think",
                                                 "rm1","amazon","kindle", "t", "amp", "emr"))

textmodel_positive_RM_new <- seededlda::textmodel_lda(dfm_tmp_RM[ docvars(dfm_tmp_RM)$ave_sentiment >= 0, ], k = 2)
textmodel_negative_RM_new <- seededlda::textmodel_lda(dfm_tmp_RM[ docvars(dfm_tmp_RM)$ave_sentiment < 0, ], k = 2)
cbind(terms(textmodel_positive_RM_new), terms(textmodel_negative_RM_new))

##       topic1       topic2       topic1      topic2      
##  [1,] "good"       "use"        "use"       "get"       
##  [2,] "work"       "feel"       "review"    "work"      
##  [3,] "remarkable" "like"       "noris"     "like"      
##  [4,] "marker"     "write"      "go"        "use"       
##  [5,] "eraser"     "tip"        "gt"        "remarkable"
##  [6,] "use"        "remarkable" "staedtler" "tip"       
##  [7,] "one"        "pencil"     "fakespot"  "try"       
##  [8,] "get"        "one"        "good"      "one"       
##  [9,] "staedtler"  "good"       "write"     "screen"    
## [10,] "buy"        "screen"     "product"   "button"

# Extract top words per topic
positive_terms_RM <- terms(textmodel_positive_RM_new, 10)
negative_terms_RM <- terms(textmodel_negative_RM_new, 10)

# Convert matrices into data frames
positive_df_RM <- data.frame(
  word = as.vector(positive_terms_RM),
  topic = rep(1:ncol(positive_terms_RM), each = nrow(positive_terms_RM)),
  sentiment = "Positive"
)

negative_df_RM <- data.frame(
  word = as.vector(negative_terms_RM),
  topic = rep(1:ncol(negative_terms_RM), each = nrow(negative_terms_RM)),
  sentiment = "Negative"
)

# Combine positive & negative topics
topics_combined_RM <- bind_rows(positive_df_RM, negative_df_RM)

# Count occurrences of words per topic
word_counts_RM <- topics_combined_RM %>%
  group_by(topic, word, sentiment) %>%
  summarise(count = n(), .groups = "drop")

# Plot: sorted within each facet
ggplot(word_counts_RM, aes(x = reorder_within(word, count, topic), y = count, fill = sentiment)) +
  geom_col(show.legend = TRUE) +
  facet_wrap(~ topic + sentiment, scales = "free_y") +
  coord_flip() +
  scale_x_reordered() +  # Needed to handle reordered_within
  labs(title = "Top Words per Topic and Sentiment",
       x = "Word", y = "Sentiment") +
  theme_minimal()

ggsave("top_words_per_topic_and_sentiment.png", width = 10, height = 42, limitsize = FALSE)

# Create a term frequency table
freq <- colSums(dfm_tmp_RM)
freq <- sort(freq, decreasing = TRUE)

# Basic word cloud
wordcloud(words = names(freq), freq = freq, min.freq = 5,
          max.words = 50, random.order = FALSE,
          colors = brewer.pal(8, "Dark2"),
          scale = c(4, 0.8))

7. REVALIDATION

Now let’s do revalidation. we use function of compute_coherence to compute the coherence of the topics. Coherence is a measure of how semantically related the words in a topic are. Higher coherence scores indicate that the words in a topic are more related to each other, which is generally desirable in topic modeling.

BIC Cristal

#Remarkable 2 Pen


compute_coherence_bic <- function(topic_terms, dfm_bic) {
  coherence_bic <- 0
  for (i in 2:length(topic_terms)) {
    for (j in 1:(i - 1)) {
      term_i <- topic_terms[i]
      term_j <- topic_terms[j]
      
      # Check if both terms exist in the dfm
      if (term_i %in% colnames(dfm_bic) && term_j %in% colnames(dfm_bic)) {
        # D(term_i and term_j)
        D_wi_wj <- sum(dfm_bic[, term_i] > 0 & dfm_bic[, term_j] > 0)
        # D(term_j)
        D_wj    <- sum(dfm_bic[, term_j] > 0)
        
        # Add smoothing +1 to avoid log(0)
        coherence_bic <- coherence_bic + log((D_wi_wj + 1) / (D_wj + 1))
      }
    }
  }
  return(coherence_bic)
}

#Remarkable 2 Pen
# Get topic terms
top_terms_pos_bic <- terms(textmodel_positive_bic_new, 10)
top_terms_neg_bic <- terms(textmodel_negative_bic_new, 10)

# Filter dfms
dfm_pos_bic <- dfm_tmp_bic[docvars(dfm_tmp_bic)$ave_sentiment >= 0, ]
dfm_neg_bic <- dfm_tmp_bic[docvars(dfm_tmp_bic)$ave_sentiment < 0, ]

# Compute coherence per topic
coherence_scores_pos_bic <- apply(top_terms_pos_bic, 2, function(term_set) {
  compute_coherence_bic(term_set, dfm_pos_bic)
})

coherence_scores_neg_bic <- apply(top_terms_neg_bic, 2, function(term_set) {
  compute_coherence_bic(term_set, dfm_neg_bic)
})

# Show mean coherence
cat("Mean coherence (Positive Model):", round(mean(coherence_scores_pos_bic), 3), "\n")

## Mean coherence (Positive Model): -81.51

cat("Mean coherence (Negative Model):", round(mean(coherence_scores_neg_bic), 3), "\n")

## Mean coherence (Negative Model): -73.942

# Optional: topic-wise scores
coherence_scores_pos_bic

##    topic1    topic2 
## -86.05098 -76.96950

coherence_scores_neg_bic

##    topic1    topic2 
## -80.50779 -67.37644

Wordcloud for BIC Cristal

# Extract top terms from each topic in positive model
top_terms_pos_bic <- terms(textmodel_positive_bic_new, 50)

# Extract top terms from each topic in negative model
top_terms_neg_bic <- terms(textmodel_negative_bic_new, 50)

# Create a list of all topic terms combined with labels
topic_terms_list_bic <- list()

for(i in 1:2) {
  topic_terms_list_bic[[paste0("pos_topic", i)]] <- top_terms_pos_bic[, i]
  topic_terms_list_bic[[paste0("neg_topic", i)]] <- top_terms_neg_bic[, i]
}

# Build a dfm with counts of these topic terms in the full filtered dfm
topic_term_freqs_bic <- lapply(topic_terms_list_bic, function(terms_vec) {
  colSums(dfm_tmp_bic[, intersect(terms_vec, featnames(dfm_tmp_bic))])
})

# Turn into a matrix with topics as rows, terms as columns
topic_term_matrix_bic <- do.call(rbind, lapply(topic_term_freqs_bic, function(freq) {
  # Make named vector to all terms with 0 for missing
  all_terms <- unique(unlist(topic_terms_list_bic))
  v <- numeric(length(all_terms))
  names(v) <- all_terms
  v[names(freq)] <- freq
  return(v)
}))

# Convert matrix to dfm
dfm_comparison_bic <- quanteda::as.dfm(topic_term_matrix_bic)

# Create a term frequency table
freq <- colSums(dfm_comparison_bic)
freq <- sort(freq, decreasing = TRUE)

# Basic word cloud
wordcloud(words = names(freq), freq = freq, min.freq = 5,
          max.words = 50, random.order = FALSE,
          colors = brewer.pal(8, "Dark2"),
          scale = c(6, 0.8))

Remarkable 2 Pen

#Remarkable 2 Pen


compute_coherence_RM <- function(topic_terms, dfm) {
  coherence <- 0
  for (i in 2:length(topic_terms)) {
    for (j in 1:(i - 1)) {
      term_i <- topic_terms[i]
      term_j <- topic_terms[j]
      
      # Check if both terms exist in the dfm
      if (term_i %in% colnames(dfm) && term_j %in% colnames(dfm)) {
        # D(term_i and term_j)
        D_wi_wj <- sum(dfm[, term_i] > 0 & dfm[, term_j] > 0)
        # D(term_j)
        D_wj    <- sum(dfm[, term_j] > 0)
        
        # Add smoothing +1 to avoid log(0)
        coherence <- coherence + log((D_wi_wj + 1) / (D_wj + 1))
      }
    }
  }
  return(coherence)
}

#Remarkable 2 Pen
# Get topic terms
top_terms_pos_RM <- terms(textmodel_positive_RM_new, 10)
top_terms_neg_RM <- terms(textmodel_negative_RM_new, 10)

# Filter dfms
dfm_pos_RM <- dfm_tmp_RM[docvars(dfm_tmp_RM)$ave_sentiment >= 0, ]
dfm_neg_RM <- dfm_tmp_RM[docvars(dfm_tmp_RM)$ave_sentiment < 0, ]

# Compute coherence per topic
coherence_scores_pos_RM <- apply(top_terms_pos_RM, 2, function(term_set) {
  compute_coherence_RM(term_set, dfm_pos_RM)
})

coherence_scores_neg_RM <- apply(top_terms_neg_RM, 2, function(term_set) {
  compute_coherence_RM(term_set, dfm_neg_RM)
})

# Show mean coherence
cat("Mean coherence (Positive Model):", round(mean(coherence_scores_pos_RM), 3), "\n")

## Mean coherence (Positive Model): -68.797

cat("Mean coherence (Negative Model):", round(mean(coherence_scores_neg_RM), 3), "\n")

## Mean coherence (Negative Model): -70.774

# Optional: topic-wise scores
coherence_scores_pos_RM

##    topic1    topic2 
## -70.27165 -67.32299

coherence_scores_neg_RM

##    topic1    topic2 
## -67.83453 -73.71311

From the result above, we can see that the coherence scores for both positive and negative models are below 0 with negative topic is closer to zero. This indicates that Negative sentiment topics are much more coherent (closer to 0) than the positive ones.

This suggests that the top words in negative-topic documents co-occur more frequently and consistently than in the positive ones.

It could mean that people writing negative reviews are more focused in their vocabulary (e.g., repeating the same criticisms), while positive ones might vary more (e.g., complimenting different features).

Wordcloud for Remarkable 2 Pen

# Extract top terms from each topic in positive model
top_terms_pos_RM <- terms(textmodel_positive_RM_new, 50)

# Extract top terms from each topic in negative model
top_terms_neg_RM <- terms(textmodel_negative_RM_new, 50)

# Create a list of all topic terms combined with labels
topic_terms_list_RM <- list()

for(i in 1:2) {
  topic_terms_list_RM[[paste0("pos_topic", i)]] <- top_terms_pos_RM[, i]
  topic_terms_list_RM[[paste0("neg_topic", i)]] <- top_terms_neg_RM[, i]
}

# Build a dfm with counts of these topic terms in the full filtered dfm
topic_term_freqs_RM <- lapply(topic_terms_list_RM, function(terms_vec) {
  colSums(dfm_tmp_RM[, intersect(terms_vec, featnames(dfm_tmp_RM))])
})

# Turn into a matrix with topics as rows, terms as columns
topic_term_matrix_RM <- do.call(rbind, lapply(topic_term_freqs_RM, function(freq) {
  # Make named vector to all terms with 0 for missing
  all_terms <- unique(unlist(topic_terms_list_RM))
  v <- numeric(length(all_terms))
  names(v) <- all_terms
  v[names(freq)] <- freq
  return(v)
}))

# Convert matrix to dfm
dfm_comparison_RM <- quanteda::as.dfm(topic_term_matrix_RM)

# Create a term frequency table
freq <- colSums(dfm_comparison_RM)
freq <- sort(freq, decreasing = TRUE)

# Basic word cloud
wordcloud(words = names(freq), freq = freq, min.freq = 5,
          max.words = 50, random.order = FALSE,
          colors = brewer.pal(8, "Dark2"),
          scale = c(6, 0.8))

8. Statistical Test

t-test

t_test_polarity <- t.test(bic_cristal_reviews$polarity, remarkable2_pen_reviews$polarity)
print(t_test_polarity)

## 
##  Welch Two Sample t-test
## 
## data:  bic_cristal_reviews$polarity and remarkable2_pen_reviews$polarity
## t = 2.2185, df = 1558.4, p-value = 0.02666
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.002369026 0.038523158
## sample estimates:
## mean of x mean of y 
## 0.1384701 0.1180240

t_test_subjectivity <- t.test(bic_cristal_reviews$subjectivity, remarkable2_pen_reviews$subjectivity)
print(t_test_subjectivity)

## 
##  Welch Two Sample t-test
## 
## data:  bic_cristal_reviews$subjectivity and remarkable2_pen_reviews$subjectivity
## t = -0.92687, df = 1650.2, p-value = 0.3541
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.007728293  0.002768154
## sample estimates:
## mean of x mean of y 
## 0.1111706 0.1136506

Polarity Comparison (Sentiment Strength) - t = 2.2545, p-value = 0.0243 → The p-value is below 0.05, meaning the difference in polarity scores is statistically significant.
- Mean Polarity: Bic Cristal (0.1388) vs. Remarkable2 Pen (0.1180) → Bic Cristal has a higher polarity, suggesting reviews tend to be more positive compared to Remarkable2 Pen.
- Confidence Interval: (0.0027, 0.0388) → The true difference in means likely falls between these values, confirming the significance.
- Conclusion: Bic Cristal reviews show more positivity in sentiment than Remarkable2 Pen reviews.

Subjectivity Comparison (Opinion Strength) - t = -0.96174, p-value = 0.3363 → The p-value is above 0.05, meaning the difference is NOT statistically significant.
- Mean Subjectivity: Bic Cristal (0.1111) vs. Remarkable2 Pen (0.1137) → The values are very close, indicating both sets of reviews have similar levels of subjective (personal) opinions.
- Confidence Interval: (-0.0078, 0.0027) → Since it crosses zero, this suggests the difference is likely due to random variation.
- Conclusion: There is no strong evidence that reviews for Bic Cristal and Remarkable2 Pen differ in terms of subjectivity.

box plot

boxplot(bic_cristal_reviews$polarity, remarkable2_pen_reviews$polarity, 
        names = c("BIC Cristal", "Remarkable 2"), col = c("lightblue", "lightgreen"),
        main = "Polarity Comparison")

boxplot(bic_cristal_reviews$subjectivity, remarkable2_pen_reviews$subjectivity, 
        names = c("BIC Cristal", "Remarkable 2"), col = c("lightblue", "lightgreen"),
        main = "Subjectivity Comparison")

Anova test

library(dplyr)

# Combine datasets and add product label
combined <- bind_rows(
  bic_cristal_reviews %>% mutate(product = "BIC"),
  remarkable2_pen_reviews %>% mutate(product = "ReMarkable")
)

# Add sentiment category (positive, neutral, negative)
combined <- combined %>%
  mutate(sentiment_cat = case_when(
    polarity > 0.1  ~ "positive",
    polarity < -0.1 ~ "negative",
    TRUE            ~ "neutral"
  ))

# Now you can run ANOVA for polarity based on product + sentiment category (or just product)
anova_result <- aov(polarity ~ product, data = combined)
summary(anova_result)

##               Df Sum Sq Mean Sq F value Pr(>F)  
## product        1   0.18 0.17683    4.97 0.0259 *
## Residuals   1727  61.44 0.03558                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova result interpretation p-value: 0.0236 - Since p < 0.05, the result is statistically significant, meaning the polarity (sentiment) scores differ significantly between BIC and ReMarkable pen reviews.

Product choice impacts sentiment polarity → The brand has a significant effect on review sentiment.
The difference is not random → Consumers review the two products differently in terms of sentiment.
However, the effect size (Sum Sq: 0.18) is relatively small, meaning the difference—while significant—is not extremely large.

Chi-Square test

# Example: Add sentiment category if not already present
bic_cristal_reviews$sentiment_cat <- ifelse(bic_cristal_reviews$polarity >= 0, "positive", "negative")
remarkable2_pen_reviews$sentiment_cat <- ifelse(remarkable2_pen_reviews$polarity >= 0, "positive", "negative")

# Combine brand label
bic_cristal_reviews$brand <- "BIC Cristal"
remarkable2_pen_reviews$brand <- "ReMarkable 2"

# Combine into one data frame
combined_reviews <- rbind(bic_cristal_reviews, remarkable2_pen_reviews)

# Create contingency table: brand vs sentiment category
table_sentiment <- table(combined_reviews$brand, combined_reviews$sentiment_cat)

# Run Chi-square test
chisq_test <- chisq.test(table_sentiment)
print(table_sentiment)

##               
##                negative positive
##   BIC Cristal       170      821
##   ReMarkable 2      166      572

print(chisq_test)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table_sentiment
## X-squared = 7.3634, df = 1, p-value = 0.006656

Chi square test result - Chi-Square Statistic (X-squared = 7.6682)
- Measures the deviation between observed and expected values under the assumption of no relationship between product type and sentiment.
- A higher value suggests a stronger association.
- p-value (p = 0.00562)
- Since p < 0.05, the result is statistically significant.
- This means sentiment distribution differs significantly between BIC Cristal and ReMarkable 2. - The difference is not due to random chance.
Interpretation & Conclusion
Product choice impacts sentiment → Reviews for BIC Cristal and ReMarkable 2 show significantly different sentiment patterns.
BIC Cristal has a higher proportion of positive reviews compared to ReMarkable 2.
ReMarkable 2 has relatively fewer positive reviews, suggesting consumers rate it less favorably overall.

9. Temporal Dynamics

At this point, as a final analysis, we are going to explore how the polarity values and the number reviews changed over time for both products, this kind of comparison will allow for a more detailed analysis. To perform this analysis, we will generate three plots: the first shows the monthly average polarity for each product, the second the the monthly comment count for each product and the third the monthly weighted polarity (average polarity*comment count) for each product.

# Convert to proper date format
safe_parse_date <- function(dates) {
  tryCatch({
    as.Date(dates)
  }, warning = function(w) {
    message("Warning: ", conditionMessage(w))
    suppressWarnings(as.Date(dates))
  }, error = function(e) {
    message("Error: ", conditionMessage(e))
    rep(NA, length(dates))
  })
}

# Safely convert dates
bic_reviews_cleaned$Date <- safe_parse_date(bic_reviews_cleaned$Date)
rm_reviews_cleaned$Date  <- safe_parse_date(rm_reviews_cleaned$Date)

# Combine and format for temporal analysis
all_reviews <- bind_rows(
  bic_reviews_cleaned %>% mutate(Product = "BIC Cristal"),
  rm_reviews_cleaned %>% mutate(Product = "ReMarkable 2 Pen")
) %>%
  mutate(Month = floor_date(Date, "month")) %>%
  filter(!is.na(Month))  # Remove any failed date parses

Monthly average polarity over time.

# Monthly average polarity
avg_polarity <- all_reviews %>%
  group_by(Product, Month) %>%
  summarise(Avg_Polarity = mean(`Polarity score`, na.rm = TRUE), .groups = "drop")

ggplot(avg_polarity, aes(x = Month, y = Avg_Polarity, color = Product)) +
  geom_line(size = 0.8, alpha = 0.9) +
  geom_point(size = 2) +
  scale_x_date(date_labels = "%Y", date_breaks = "1 year", minor_breaks = "1 month") +
  labs(
    title = "Monthly Average Polarity per Product",
    x = "Year",
    y = "Average Sentiment Polarity",
    color = "Product"
  ) +
  theme_minimal(base_size = 14) +
  theme(legend.position = "bottom", 
        panel.grid.minor = element_line(color = "gray90"),
        panel.grid.major.x = element_line(color = "gray75"))

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Monthly comment count over time.

# Monthly review count
review_count <- all_reviews %>%
  group_by(Product, Month) %>%
  summarise(Count = n(), .groups = "drop")

ggplot(review_count, aes(x = Month, y = Count, fill = Product)) +
  geom_col(position = "dodge", width = 25) +
  scale_x_date(date_labels = "%Y", date_breaks = "1 year", minor_breaks = "1 month") +
  labs(
    title = "Monthly Review Volume per Product",
    x = "Year",
    y = "Number of Reviews",
    fill = "Product"
  ) +
  theme_minimal(base_size = 14) +
  theme(legend.position = "bottom", 
        panel.grid.minor = element_line(color = "gray90"),
        panel.grid.major.x = element_line(color = "gray75"))

Monthly weighted polarity (average polarity x comment count).

# Weighted polarity (avg polarity * count)
weighted_polarity <- left_join(avg_polarity, review_count, by = c("Product", "Month")) %>%
  mutate(Weighted_Polarity = Avg_Polarity * Count)

ggplot(weighted_polarity, aes(x = Month, y = Weighted_Polarity, color = Product)) +
  geom_line(size = 0.8, alpha = 0.9) +
  geom_point(size = 2) +
  scale_x_date(date_labels = "%Y", date_breaks = "1 year", minor_breaks = "1 month") +
  labs(
    title = "Monthly Weighted Polarity (Polarity × Volume)",
    x = "Year",
    y = "Weighted Sentiment",
    color = "Product"
  ) +
  theme_minimal(base_size = 14) +
  theme(legend.position = "bottom", 
        panel.grid.minor = element_line(color = "gray90"),
        panel.grid.major.x = element_line(color = "gray75"))

t-test

# Extract weighted polarity values
bic_weighted_polarity <- weighted_polarity %>% filter(Product == "BIC Cristal") %>% pull(Weighted_Polarity)
rm_weighted_polarity  <- weighted_polarity %>% filter(Product == "ReMarkable 2 Pen") %>% pull(Weighted_Polarity)

  # Perform t-test comparing weighted polarity between products
t_test_result <- t.test(bic_weighted_polarity, rm_weighted_polarity, alternative = "two.sided")

# Display results
print(t_test_result)

## 
##  Welch Two Sample t-test
## 
## data:  bic_weighted_polarity and rm_weighted_polarity
## t = 2.8839, df = 90.391, p-value = 0.004909
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.3191811 1.7324241
## sample estimates:
## mean of x mean of y 
##  2.325828  1.300026

t-test interpretation - p-value = 0.004704 → Since p < 0.05, this result is statistically significant. The difference in sentiment scores is unlikely due to random variation.
- 95% Confidence Interval (0.3244 to 1.7378) → This means the true difference in weighted
sentiment polarity likely falls within this range.
- Mean Weighted Polarity
- BIC Cristal: 2.3311
- ReMarkable 2 Pen: 1.3000
- BIC Cristal reviews tend to have stronger positive sentiment, meaning users express more favorable emotions toward it compared to ReMarkable 2.
Interpretation
Statistically significant difference: The sentiment polarity of reviews differs meaningfully between BIC Cristal and ReMarkable 2.
Higher positive sentiment for BIC Cristal: Consumers review it more favorably compared to ReMarkable 2.
Confidence interval confirms the difference: The range does not cross zero, reinforcing that the polarity gap is real.

Anova test

# Ensure Product is a factor
weighted_polarity <- weighted_polarity %>% mutate(Product = factor(Product))

# Perform ANOVA
anova_result <- aov(Weighted_Polarity ~ Product, data = weighted_polarity)

# Display results
summary(anova_result)

##              Df Sum Sq Mean Sq F value Pr(>F)   
## Product       1   33.0   33.01   8.864 0.0035 **
## Residuals   124  461.8    3.72                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova test interpretation
- F-Value (F = 8.955) → Indicates a moderate difference between the product groups.
- p-Value (p = 0.00334) → Since p < 0.05, this result is statistically significant.
- The difference in weighted sentiment is not due to random variation.
- The reviews for BIC Cristal and ReMarkable 2 Pen show meaningful differences in polarity trends.
Conclusion
Product choice influences weighted polarity → Consumer sentiment varies significantly between BIC Cristal and ReMarkable 2 Pen.
Since p < 0.01, the difference is strong → Review sentiment distributions are meaningfully distinct.
Effect size considerations: The F-value suggests a moderate effect, meaning while there is a difference, it may not be an extremely large one.

Chi-Square test

# Define positive and negative sentiment categories
weighted_polarity <- weighted_polarity %>%
  mutate(Sentiment = ifelse(Weighted_Polarity > 0, "Positive", "Negative"))

# Create a contingency table
table_sentiment <- table(weighted_polarity$Product, weighted_polarity$Sentiment)

# Print table to verify
print(table_sentiment)

##                   
##                    Negative Positive
##   BIC Cristal             2       57
##   ReMarkable 2 Pen        4       63

# Perform Pearson's Chi-Square Test
chi_test_result <- chisq.test(table_sentiment)

## Warning in chisq.test(table_sentiment): Chi-squared approximation may be
## incorrect

# Display results
print(chi_test_result)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table_sentiment
## X-squared = 0.067335, df = 1, p-value = 0.7953

Chi-Square test interpretation
- p-Value (p = 0.7953) → Since p > 0.05, this result is not statistically significant.
Interpretation
Weak association between product and sentiment → Sentiment distribution is fairly similar for both products.
The difference observed may be due to random variation rather than a meaningful impact from product type.

Insight
BIC is more positive, but the effect is small → Detected by t-test & ANOVA, but not strong enough for Chi-Square.
Chi-Square struggles with low “Negative” counts → Meaning it lacks the statistical power to confirm subtle sentiment shifts.
Both results are valid → The difference is real, but sentiment categorization alone doesn’t fully capture it.

Last 3 years analysis

By looking at the graphs (even if both fluctuate greatly), we can observe that, while previously, since 2019 (when we have the first ReMarkable pen comments), sentiment scores for the BIC Cristal where almost constantly higher, starting approximately from 2022, sentiment score trends between the two products are more similar. With a start of 2025 that again shows BIC Cristal’s average and weighted sentiment scores higher. For this reason, we are now going to perform again the statistical tests completed in the previous section, but focusing only on the last three years, in this way we can observe if there is still a significant difference in the weighted scores between the products or if the sentiment regarding one or both pens has possibly changed in these last years. We first subset the ‘weighted_polarity’ data set to have only the comments posted starting from 2022.

# 1. Subset the existing 'weighted_polarity' data frame
weighted_recent <- weighted_polarity %>%
  filter(as.Date("2022-01-01") <= Month)

Then we perform the statistical tests.

# 2. T-TEST
bic_weighted_recent <- weighted_recent %>%
  filter(Product == "BIC Cristal") %>%
  pull(Weighted_Polarity)

rm_weighted_recent  <- weighted_recent %>%
  filter(Product == "ReMarkable 2 Pen") %>%
  pull(Weighted_Polarity)

t_test_recent <- t.test(bic_weighted_recent,
                        rm_weighted_recent,
                        alternative = "two.sided")

print("T-Test Results for Weighted Polarity (Jan 2022–present):")

## [1] "T-Test Results for Weighted Polarity (Jan 2022–present):"

print(t_test_recent)

## 
##  Welch Two Sample t-test
## 
## data:  bic_weighted_recent and rm_weighted_recent
## t = 2.4702, df = 65.592, p-value = 0.01611
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.2295923 2.1664827
## sample estimates:
## mean of x mean of y 
##  2.808691  1.610653

# 3. ANOVA test
anova_recent <- aov(Weighted_Polarity ~ Product,
                    data = weighted_recent)

print("ANOVA Results for Weighted Polarity by Product (Jan 2022–present):")

## [1] "ANOVA Results for Weighted Polarity by Product (Jan 2022–present):"

summary(anova_recent)

##             Df Sum Sq Mean Sq F value Pr(>F)  
## Product      1   29.4  29.424   6.102 0.0156 *
## Residuals   80  385.8   4.822                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# 4. CHI-SQUARE test
weighted_recent <- weighted_recent %>%
  mutate(Sentiment = ifelse(Weighted_Polarity > 0, "Positive", "Negative"))

table_sentiment_recent <- table(weighted_recent$Product,
                                weighted_recent$Sentiment)

print("Contingency Table (Product × Sentiment) for 2022–present:")

## [1] "Contingency Table (Product × Sentiment) for 2022–present:"

print(table_sentiment_recent)

##                   
##                    Negative Positive
##   BIC Cristal             2       39
##   ReMarkable 2 Pen        2       39

chi_sq_recent <- chisq.test(table_sentiment_recent)

## Warning in chisq.test(table_sentiment_recent): Chi-squared approximation may be
## incorrect

print("Chi-Square Test Results (Jan 2022–present):")

## [1] "Chi-Square Test Results (Jan 2022–present):"

print(chi_sq_recent)

## 
##  Pearson's Chi-squared test
## 
## data:  table_sentiment_recent
## X-squared = 0, df = 1, p-value = 1

Even considering just the last three years, the statistical tests performed give us similar results as before, but they indicate that the difference between the weighted polarity scores decreased (the p value increased), even if that difference remained significant for both the t-test and the ANOVA test. The chi-square test, as before, results in a statistically non-significant result.

Now, as a last analysis, also out of curiosity, we can do the same statistical tests, but only on the period between 2022 and the end of 2024, thus we won’t consider also reviews published after the start of 2025.

# 1. Subset the existing 'weighted_polarity' data frame
weighted_recent1 <- weighted_polarity %>%
  filter(as.Date("2022-01-01") <= Month) %>%
  filter(Month <= as.Date("2024-12-01"))

# 2. T-TEST
bic_weighted_recent1 <- weighted_recent1 %>%
  filter(Product == "BIC Cristal") %>%
  pull(Weighted_Polarity)

rm_weighted_recent1  <- weighted_recent1 %>%
  filter(Product == "ReMarkable 2 Pen") %>%
  pull(Weighted_Polarity)

t_test_recent1 <- t.test(bic_weighted_recent1,
                        rm_weighted_recent1,
                        alternative = "two.sided")

print("T-Test Results for Weighted Polarity (Jan 2022–2025):")

## [1] "T-Test Results for Weighted Polarity (Jan 2022–2025):"

print(t_test_recent1)

## 
##  Welch Two Sample t-test
## 
## data:  bic_weighted_recent1 and rm_weighted_recent1
## t = 1.1746, df = 57.625, p-value = 0.245
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4162475  1.5980779
## sample estimates:
## mean of x mean of y 
##  2.388061  1.797146

# 3. ANOVA test
anova_recent1 <- aov(Weighted_Polarity ~ Product,
                    data = weighted_recent1)

print("ANOVA Results for Weighted Polarity by Product (Jan 2022–2025):")

## [1] "ANOVA Results for Weighted Polarity by Product (Jan 2022–2025):"

summary(anova_recent1)

##             Df Sum Sq Mean Sq F value Pr(>F)
## Product      1    6.2   6.197   1.396  0.241
## Residuals   69  306.2   4.438

# 4. CHI-SQUARE test
weighted_recent1 <- weighted_recent1 %>%
  mutate(Sentiment = ifelse(Weighted_Polarity > 0, "Positive", "Negative"))

table_sentiment_recent1 <- table(weighted_recent1$Product,
                                weighted_recent1$Sentiment)

print("Contingency Table (Product × Sentiment) for 2022–2025:")

## [1] "Contingency Table (Product × Sentiment) for 2022–2025:"

print(table_sentiment_recent1)

##                   
##                    Negative Positive
##   BIC Cristal             2       33
##   ReMarkable 2 Pen        1       35

chi_sq_recent1 <- chisq.test(table_sentiment_recent1)

## Warning in chisq.test(table_sentiment_recent1): Chi-squared approximation may
## be incorrect

print("Chi-Square Test Results (Jan 2022–2025):")

## [1] "Chi-Square Test Results (Jan 2022–2025):"

print(chi_sq_recent1)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table_sentiment_recent1
## X-squared = 0.0006215, df = 1, p-value = 0.9801

As it can be observed, all the performed statistical tests result in statistically non-significant differences between the means. Thus, in the period starting from the beginning of 2022 until the end of 2024, the weighted polarity scores were similar between the two products.

Group 9 - Text mining HTM

2025-05-28

1. INTRODUCTORY STEPS

2. DATA EXPLORATION AND PREPARATION

3. MODEL PREPARATION

4. TOPIC MODELING

5. INTERPRETATION AND EVALUATION

6. MODEL IMPROVEMENT

BIC Cristal

Remarkable 2 Pen

7. REVALIDATION

BIC Cristal

Remarkable 2 Pen

8. Statistical Test

t-test

Anova test

Chi-Square test

9. Temporal Dynamics

t-test

Anova test

Chi-Square test

Last 3 years analysis