1. INTRODUCTION

Text Mining is the process of extracting meaningful and relevant information from a large amount of unstructured or semi-structured text data. It involves techniques such as natural language processing, sentiment analysis, and topic modeling to uncover patterns, trends, and insights from text data sources such as social media posts, customer reviews, emails, and more. The goal of text mining is to transform raw text into structured data that can be analyzed and used to support decision-making and strategy formation. By leveraging the power of text mining, it is possible to gain a deeper understanding of businesses and industry trends, and use this information to drive growth and success.

1.1 Purpose of the project

In this project, we aimed to harness the power of text mining techniques to uncover insights and patterns that would be difficult to identify through traditional data analysis methods.

The objectives of writing this project are:

to understand the importance of data pre-processing, cleaning, and normalization in text mining.
to analyze and extract relevant insights from the datasets obtained through Sentimental Analysis and Topic Modelling.
to understand the challenges of working with unstructured text data and how to overcome them.
to learn how to visualize and communicate results in a meaningful and actionable way.

1.2 Main Assumptions

Elon Musk Tweets

Many people see Elon Musk as a controversial figure. Elon Musk is a South African-born entrepreneur and business magnate known for co-founding PayPal and leading companies such as Tesla Motors, SpaceX, Neuralink, and The Boring Company. He is one of the wealthiest individuals in the world and is known for his ambitious goals to revolutionize transportation (both terrestrial and interplanetary) and reduce the impact of humanity on the environment.

His tweets have been known to impact markets, cause confusion and controversy, and even result in legal consequences. Therefore, it’s important to approach his tweets with a critical eye and consider the context and potential consequences before taking any actions based on the information shared. Below are the main assumptions considered about Elon Musk’s tweets:

That his tweets always reflect his personal opinions.
That his tweets are always serious and accurate.
That his tweets are indicative of future plans.

Amazon Ear Phone Reviews

Amazon is a multinational technology company based in Seattle, Washington and founded in 1994 by Jeff Bezos. The comapany started as an online bookstore and has since grown into one of the largest e-commerce and cloud computing companies in the world. Amazon offers a range of products and services, including online shopping for consumer goods, cloud computing services for businesses, streaming of music and video, and more.

Earphones got through Amazon e-commerce platform have received mixed reviews from customers. Some customers praise the products obtained for its convenience, sound quality and battery life. Other customers have noted connectivity issues and discomfort during extended use.

Below are the main assumptions considered about Amazon earphones reviews:

That all reviews made are trustworthy and unbiased.
That a high number of positive reviews automatically means the product is good.
That reviews accurately represent the experiences of all users and not a selected few.

2. DATA

It is important to note that data is the foundation of any text mining project and the quality and quantity of data used can greatly impact the accuracy and relevance of the insights obtained. With more data, text mining algorithms can better identify patterns and relationships, leading to more meaningful and actionable results. Moreover, having clean and well-structured data enables text mining algorithms to function efficiently and effectively, improving the overall results.

2.1 Description of the Datasets

To achieve the purpose of this project, two different datasets were utilized and each of them will be adequately described below. Analysis on these datasets was done with the aim of performing Sentimental Analysis and Topic Modelling on their content to gain insights into the sentiments or emotions expressed and uncover the underlying structure they contain by identifying clusters of words that commonly co-occur.

The Elon Musk tweet dataset was obtained from Kaggle and contains information about tweets made by Elon Musk in 2022. The dataset consists of 4 columns with 2994 rows of data.

The csv file contains records with the following variables, description and data type.

Variable	Desciption	Data Type
Tweets	Original text of the tweets	char
Retweets	Number of retweets	numeric
Likes	Number of likes	numeric
Date	Date of creation	char

The Amazon Earphone reviews dataset was obtained from Kaggle and consists of Amazon reviews and star ratings for 10 latest bluetooth earphone devices. The dataset consists of 4 columns with 14,337 rows of data.

The csv file contains records with the following variables, description and data type.

Variable	Desciption	Data Type
ReviewTitle	Title of the Review	char
ReviewBody	Body of the Review	char
ReviewStar	Stars given by Customer	numeric
Product	Product Name	char

2.2 Preparation of data for modeling

This section of the project contains steps involved in preparing the data for modelling. We will examine both datasets to determine if there would be any need for data pre-processing such as data cleaning, splitting, transformation etc.

# loading necessary libraries
library(readr) 
library(stringr)
library(tm)
library(igraph)
library(networkD3)
library(dplyr)
library(wordcloud)
library(RColorBrewer)
library(syuzhet)
library(ggplot2)
library(tidyverse)
library(tidytext)
library(topicmodels)
library(SnowballC)
library(lubridate)

# setting the working directory 
getwd()
setwd("C:/Users/Acer/Desktop/LECTURES/Year 2, Semester I/Text Mining and Social Media Mining/FOR PROJECT")

2.2.1 Elon Musk Dataset

# reading the data
emusk <- read.csv('rawdata.csv')

glimpse(emusk)

## Rows: 3,060
## Columns: 4
## $ Tweets   <chr> "@PeterSchiff ðŸ¤£ thanks", "@ZubyMusic Absolutely", "Dear Tw~
## $ Retweets <int> 209, 755, 55927, 802, 9366, 145520, 194, 117, 699, 126, 37951~
## $ Likes    <int> 7021, 26737, 356623, 19353, 195546, 1043592, 3611, 2848, 1018~
## $ Date     <chr> "2022-10-27 16:17:39", "2022-10-27 13:19:25", "2022-10-27 13:~

head(emusk, 5)

##                                             Tweets Retweets  Likes
## 1                         @PeterSchiff ðŸ¤£ thanks      209   7021
## 2                            @ZubyMusic Absolutely      755  26737
## 3 Dear Twitter Advertisers https://t.co/GMwHmInPAS    55927 356623
## 4                                   @BillyM2k ðŸ‘»      802  19353
## 5   Meeting a lot of cool people at Twitter today!     9366 195546
##                  Date
## 1 2022-10-27 16:17:39
## 2 2022-10-27 13:19:25
## 3 2022-10-27 13:08:00
## 4 2022-10-27 02:32:48
## 5 2022-10-26 21:39:32

From observation of the data types for each column, it is evident that that the Date column requires some modifications because it should be represented with a date data type.

# splitting the Date column of the csv file into separate parts (Year, Month, Day)
emusk$Date <- as.Date(emusk$Date)

emusk$Year <- year(ymd(emusk$Date))
emusk$Month <- month(ymd(emusk$Date)) 
emusk$Day <- day(ymd(emusk$Date))

# checking the distribution of the variable "Month" 
table(emusk$Month)

## 
##   1   2   3   4   5   6   7   8   9  10 
##  55 249 262 317 576 344 314 246 258 439

# replacing the months number with month abbreviations.
emusk$Month <- as.factor(emusk$Month)
emusk <- emusk %>%
  mutate(Month = recode(Month, "1" = 'Jan', "2" = 'Feb', "3" = 'Mar', "4" = "Apr", "5" = "May",
                        "6" = "Jun", "7" = "Jul", "8" = "Aug", "9" = "Sep", "10" = "Oct"))

head(emusk,5)

##                                             Tweets Retweets  Likes       Date
## 1                         @PeterSchiff ðŸ¤£ thanks      209   7021 2022-10-27
## 2                            @ZubyMusic Absolutely      755  26737 2022-10-27
## 3 Dear Twitter Advertisers https://t.co/GMwHmInPAS    55927 356623 2022-10-27
## 4                                   @BillyM2k ðŸ‘»      802  19353 2022-10-27
## 5   Meeting a lot of cool people at Twitter today!     9366 195546 2022-10-26
##   Year Month Day
## 1 2022   Oct  27
## 2 2022   Oct  27
## 3 2022   Oct  27
## 4 2022   Oct  27
## 5 2022   Oct  26

2.2.2 Amazon Earphone Reviews Dataset

# read the dataset - Amazon reviews
df <- readr::read_csv("AllProductReviews.csv")
glimpse(df)

## Rows: 14,337
## Columns: 4
## $ ReviewTitle <chr> "Honest review of an edm music lover\n", "Unreliable earph~
## $ ReviewBody  <chr> "No doubt it has a great bass and to a great extent noise ~
## $ ReviewStar  <dbl> 3, 1, 4, 1, 5, 1, 4, 3, 5, 1, 4, 1, 5, 1, 1, 3, 4, 3, 1, 3~
## $ Product     <chr> "boAt Rockerz 255", "boAt Rockerz 255", "boAt Rockerz 255"~

The Amazon Earphone review dataset seems to be coded with appropriate data types hence, no modifications needed.

3. MODELLING

This section contains modelling using text mining algorithms to gain insights from the data. The choice of model that was used depended on the nature of the data and the problem being solved, as well as the goals and objectives of writing this data mining project.

3.1 Sentiment Analysis on Elon Musk Tweets

# getting rid of all non-ASCII characters and other unwanted expressions
emusk$Tweets <- gsub("[^\x01-\x7F]", "", emusk$Tweets)
emusk$Tweets <- gsub("^@[A-Za-z0-9]*", "" , emusk$Tweets)
emusk$Tweets <- gsub("(www|http:|https:)+[^\\s]+[\\w]", "" , emusk$Tweets)

# Load the data as a corpus
TextDoc  <- Corpus(VectorSource(emusk$Tweets))

#Replacing "/", "@" and "|" with space
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
TextDoc <- tm_map(TextDoc, toSpace, "/")
TextDoc <- tm_map(TextDoc, toSpace, "@")
TextDoc <- tm_map(TextDoc, toSpace, "\\|")

# Convert the text to lower case
TextDoc <- tm_map(TextDoc, content_transformer(tolower))

# Remove numbers
TextDoc <- tm_map(TextDoc, removeNumbers)

# Remove english common stopwords
TextDoc <- tm_map(TextDoc, removeWords, stopwords("english"))

# specify my custom stopwords and meaningless words as a character vector
TextDoc <- tm_map(TextDoc, removeWords, c("https", "@", "http", "&", "amp", "ðY¤£",  
                                          "ðÿ", "t.co", "â", "¤£", "will", "[^\x01-\x7F]"))

# Remove punctuation
TextDoc <- tm_map(TextDoc, removePunctuation)

# Eliminate extra white spaces
TextDoc <- tm_map(TextDoc, stripWhitespace)

# Text stemming - which reduces words to their root form
TextDoc <- tm_map(TextDoc, stemDocument)

# Build a term-document matrix
TextDoc_dtm <- TermDocumentMatrix(TextDoc)
dtm_m <- as.matrix(TextDoc_dtm)

# Sort by descreasing value of frequency
dtm_v <- sort(rowSums(dtm_m),decreasing=TRUE)
dtm_d <- data.frame(word = names(dtm_v),freq=dtm_v)

# Display the top 8 most frequent words
head(dtm_d, 8)

##            word freq
## tesla     tesla  231
## twitter twitter  155
## spacex   spacex  129
## good       good  102
## time       time  100
## peopl     peopl   97
## just       just   94
## year       year   91

# barchart representation of the top 10 frequent Words 
barplot(dtm_d[1:10,]$freq, las = 2, names.arg = dtm_d[1:10,]$word,
        col = hsv(1, 1, seq(0,1,length.out = 10)), 
                  main ="Top 10 most frequent words",
        ylab = "Word frequencies")

# generate word cloud
set.seed(1234)
wordcloud(words = dtm_d$word, freq = dtm_d$freq, min.freq = 5,
          max.words= 100, random.order = FALSE, rot.per = 0.40, 
          colors = brewer.pal(8, "Dark2"))

# Finding associations 
findAssocs(TextDoc_dtm, terms = c("tesla","twitter","spacex"), corlimit = 0.20)

## $tesla
## credit  owner 
##   0.23   0.20 
## 
## $twitter
## account    spam   sampl    fake 
##    0.28    0.24    0.22    0.21 
## 
## $spacex
##         djsnm      farryfaz       labpadr    marcushous     scalesnew 
##          0.25          0.25          0.25          0.25          0.25 
## spaceflashnew spaceintellig 
##          0.25          0.23

In the top 3 words, there exists associations with tesla, twitter and spacex. The output indicates that “credit” occurred ~20% of the time with the word “tesla”. Similarly words like “account”, “spam”, “fake” occurred ~ 20% of the time with the word “twitter”.

# regular sentiment score using get_sentiment() function 
syuzhet_vector <- get_sentiment(emusk$Tweets, method = "syuzhet")
summary(syuzhet_vector)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -4.2500  0.0000  0.0000  0.2435  0.6000  5.2000

# bing
bing_vector <- get_sentiment(emusk$Tweets, method = "bing")
summary(bing_vector)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -4.0000  0.0000  0.0000  0.1503  1.0000  5.0000

# affin
afinn_vector <- get_sentiment(emusk$Tweets, method = "afinn")
summary(afinn_vector)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -13.0000   0.0000   0.0000   0.5078   2.0000  10.0000

#compare the first row of each vector using sign function
rbind(
  sign(head(syuzhet_vector)),
  sign(head(bing_vector)),
  sign(head(afinn_vector))
)

From the three methods of the get_sentiment() method, the overall average sentiment across Musk’s tweets is positive since the Medians are equal to 0 and the Means are all positive.

# to run nrc sentiment analysis that returns a data frame with each row classified as one of the
# following emotions, rather than a score: anger, anticipation, disgust, fear, joy, sadness, surprise, trust. 
# it also counts the number of positive and negative emotions found in each row
d <- get_nrc_sentiment(emusk$Tweets)

# transposing the dataframe
td <- data.frame(t(d))

# computing column sums across rows for each level of a grouping variable.
td_new <- data.frame(rowSums(td[2:3060]))

# transformation and cleaning
names(td_new)[1] <- "count"
td_new <- cbind("sentiment" = rownames(td_new), td_new)
rownames(td_new) <- NULL

# excluding the negative and postive counts
td_new2 <- td_new[1:8,]

# count of words associated with each sentiment
quickplot(sentiment, data=td_new2, weight=count, geom="bar", fill=sentiment, ylab="count")+
  ggtitle("Tweet Sentiments")

# count of words associated with each sentiment expressed as a percentage
barplot(
  sort(colSums(prop.table(d[, 1:8]))), 
  horiz = TRUE, 
  cex.names = 0.7, 
  las = 1, 
  main = "Emotions in Text", xlab="Percentage"
)

According to the plot for the nrc sentiment, most of Elon Musk’s tweets are highly subjective and are perceived as positive.

# to check the frequency of tweeting per month by Elon Musk 
# plot for tweets per month
ggplot(data = emusk, aes(x = Month)) +
  geom_bar(aes(fill = ..count..)) +
  xlab("Month") + ylab("Number of tweets") +
  theme_classic()

During the ten months period under observation, Musk tweeted most in the month of May and did relatively little of tweeting in the month of January.

3.2 Topic Modelling on Elon Musk Tweets

3.2.1 LDA on Elon Musk Tweets

Latent Dirichlet Allocation (LDA) is a generative probabilistic model for analyzing and identifying topics in a collection of documents. LDA assumes that each document in a collection is generated by a mixture of underlying topics and that each topic is characterized by a distribution of words. The model infers the topics and word distributions from the text data, and it can be used to represent the content of documents in terms of these topics.

# performing Latent Dirichlet Allocation (LDA) on the tweets using the LDA function.
top_terms_by_topic_LDA <- function(input_text, plot = T, number_of_topics = 4) 
{    
  Corpus <- Corpus(VectorSource(input_text)) 
  DTM <- DocumentTermMatrix(Corpus) 
  
  unique_indexes <- unique(DTM$i) 
  DTM <- DTM[unique_indexes,] 
  
  lda <- LDA(DTM, k = number_of_topics, control = list(seed = 1234))
  topics <- tidy(lda, matrix = "beta")
  
  top_terms <- topics  %>% 
    group_by(topic) %>% 
    top_n(10, beta) %>% 
    ungroup() %>% 
    arrange(topic, -beta) 
  
  if(plot == T){
    top_terms %>% # take the top terms
      mutate(term = reorder(term, beta)) %>%  
      ggplot(aes(term, beta, fill = factor(topic))) + 
      geom_col(show.legend = FALSE) + 
      facet_wrap(~ topic, scales = "free") + 
      labs(x = NULL, y = "Beta") +  
      coord_flip() 
  }else{ 
    return(top_terms)
  }
}

# plot top ten terms in the tweets by topic
top_terms_by_topic_LDA(emusk$Tweets, number_of_topics = 2)

# LDA generated a lot of stop words so  the algorithm needs more improvement

# creating a document term matrix to clean
tweetsCorpus <- Corpus(VectorSource(emusk$Tweets)) 
tweetsDTM <- DocumentTermMatrix(tweetsCorpus)

# converting the document term matrix to a tidytext corpus
tweetsDTM_tidy <- tidy(tweetsDTM)

# adding custom stop words 
custom_stop_words <- tibble(word = c("https", "@", "http", "&", "amp", "ðY¤£",  
                                     "ðÿ", "t.co", "â", "¤£", "will", "[^\x01-\x7F]"))

# remove stopwords and meaningless words
tweetsDTM_tidy_cleaned <- tweetsDTM_tidy %>% 
  anti_join(stop_words, by = c("term" = "word")) %>% 
  anti_join(custom_stop_words, by = c("term" = "word")) 

# reconstruct cleaned documents (so that each word shows up the correct number of times)
cleaned_documents <- tweetsDTM_tidy_cleaned %>%
  group_by(document) %>% 
  mutate(terms = toString(rep(term, count))) %>%
  select(document, terms) %>%
  unique()

# now let's look at the new most informative terms
top_terms_by_topic_LDA(cleaned_documents$terms, number_of_topics = 2)

# stem the words (e.g. convert each word to its stem, where applicable)
tweetsDTM_tidy_cleaned <- tweetsDTM_tidy_cleaned %>% 
  mutate(stem = wordStem(term))

# reconstruct our documents
cleaned_documents <- tweetsDTM_tidy_cleaned %>%
  group_by(document) %>% 
  mutate(terms = toString(rep(stem, count))) %>%
  select(document, terms) %>%
  unique()

# now let's look at the new most informative terms
top_terms_by_topic_LDA(cleaned_documents$terms, number_of_topics = 2)

3.2.2 TF-IDF on Elon Musk Tweets

TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a numerical statistic used to reflect the importance of a term in a document within a set of documents. The idea behind TF-IDF is to determine the importance of a term by considering how often it appears in a document compared to the number of documents in which it appears. The term frequency (TF) measures how frequently a term appears in a document, and the inverse document frequency (IDF) measures how rare the term is across the entire set of documents. The product of these two values gives the TF-IDF weight of a term, which can then be used to rank the importance of terms within a document or across a set of documents.

top_terms_by_topic_tfidf <- function(text_df, text_column, group_column, plot = T){
  group_column <- enquo(group_column)
  text_column <- enquo(text_column)
  
  # get the count of each word in each review
  words <- text_df %>%
    unnest_tokens(word, !!text_column) %>%
    count(!!group_column, word) %>% 
    ungroup()
  
  # get the number of words per text
  total_words <- words %>% 
    group_by(!!group_column) %>% 
    summarize(total = sum(n))
  
  # combine the two dataframes we just made
  words <- left_join(words, total_words)
  
  # get the tf_idf & order the words by degree of relevence
  tf_idf <- words %>%
    bind_tf_idf(word, !!group_column, n) %>%
    select(-total) %>%
    arrange(desc(tf_idf)) %>%
    mutate(word = factor(word, levels = rev(unique(word))))
  
  if(plot == T){
    # convert "group" into a quote of a name
    # (this is due to funkiness with calling ggplot2
    # in functions)
    group_name <- quo_name(group_column)
    
    # plot the 10 most informative terms per topic
    tf_idf %>% 
      group_by(!!group_column) %>% 
      top_n(10) %>% 
      ungroup %>%
      ggplot(aes(word, tf_idf, fill = as.factor(group_name))) +
      geom_col(show.legend = FALSE) +
      labs(x = NULL, y = "tf-idf") +
      facet_wrap(reformulate(group_name), scales = "free") +
      coord_flip()
  }else{
    # return the entire tf_idf dataframe
    return(tf_idf)
  }
}

# let's see what our most informative words were per month
top_terms_by_topic_tfidf(text_df = emusk, 
                         text_column = Tweets, 
                         group_column = Month, 
                         plot = T)

3.3 Sentiment Analysis on Amazon Earphone Review Dataset

# Load the data as a corpus
TextDoc  <- Corpus(VectorSource(df$ReviewBody))

# Quick View of the data
head(df)

## # A tibble: 6 x 4
##   ReviewTitle                                            Revie~1 Revie~2 Product
##   <chr>                                                  <chr>     <dbl> <chr>  
## 1 "Honest review of an edm music lover\n"                "No do~       3 boAt R~
## 2 "Unreliable earphones with high cost\n"                "This ~       1 boAt R~
## 3 "Really good and durable.\n"                           "i bou~       4 boAt R~
## 4 "stopped working in just 14 days\n"                    "Its s~       1 boAt R~
## 5 "Just Awesome Wireless Headphone under 1000...\U0001f~ "Its A~       5 boAt R~
## 6 "Charging port not working\n"                          "After~       1 boAt R~
## # ... with abbreviated variable names 1: ReviewBody, 2: ReviewStar

#Replacing "/", "@" and "|" with space
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
TextDoc <- tm_map(TextDoc, toSpace, "/")
TextDoc <- tm_map(TextDoc, toSpace, "@")
TextDoc <- tm_map(TextDoc, toSpace, "\\|")

# Convert the text to lower case
TextDoc <- tm_map(TextDoc, content_transformer(tolower))

# Remove numbers
TextDoc <- tm_map(TextDoc, removeNumbers)

# Remove english common stopwords
TextDoc <- tm_map(TextDoc, removeWords, stopwords("english"))

# Remove your own stop word
# specify your custom stopwords as a character vector
TextDoc <- tm_map(TextDoc, removeWords, c("https"))

# Remove punctuations
TextDoc <- tm_map(TextDoc, removePunctuation)

# Eliminate extra white spaces
TextDoc <- tm_map(TextDoc, stripWhitespace)

# Text stemming - which reduces words to their root form
TextDoc <- tm_map(TextDoc, stemDocument)


# Build a term-document matrix
TextDoc_dtm <- TermDocumentMatrix(TextDoc)
dtm_m <- as.matrix(TextDoc_dtm)

# Sort by desceasing value of frequency
dtm_v <- sort(rowSums(dtm_m),decreasing=TRUE)
dtm_d <- data.frame(word = names(dtm_v),freq=dtm_v)

# Display the top 8 most frequent words
head(dtm_d, 8)

##            word freq
## good       good 6930
## sound     sound 6202
## qualiti qualiti 5945
## product product 4956
## bass       bass 2753
## earphon earphon 2615
## use         use 2510
## work       work 2024

# This is the barchart representation of the Top 5 Words most appeared
barplot(dtm_d[1:8,]$freq, las = 2, names.arg = dtm_d[1:8,]$word,
        col = hsv(1, 1, seq(0,1,length.out = 10)), main ="Top 8 most frequent words",
        ylab = "Word frequencies")

#generate word cloud
set.seed(1234)
wordcloud(words = dtm_d$word, freq = dtm_d$freq, min.freq = 5,
          max.words=100, random.order=FALSE, rot.per=0.40, 
          colors=brewer.pal(8, "Dark2"))

# Find associations 
findAssocs(TextDoc_dtm, terms = c("good","sound","qualiti"), corlimit = 0.15)

## $good
##    bass batteri    also    life  overal 
##    0.20    0.19    0.18    0.18    0.15 
## 
## $sound
##    bass earphon   clear 
##    0.24    0.18    0.16 
## 
## $qualiti
## build built  wire  also 
##  0.31  0.18  0.16  0.15

With a correlation limit of 0.15, words like “battery”, “life”, “overall” etc were associated with “good” while words like “earphone”, “clear”, “bass” were associated with “sound”.

# regular sentiment score using get_sentiment() function
syuzhet_vector <- get_sentiment(df$ReviewBody, method = "syuzhet")
summary(syuzhet_vector)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -4.100   0.250   0.750   1.045   1.600  16.100

# using bing method
bing_vector <- get_sentiment(df$ReviewBody, method="bing")
summary(bing_vector)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -7.000   0.000   1.000   1.124   2.000  14.000

# using affin method
afinn_vector <- get_sentiment(df$ReviewBody, method="afinn")
summary(afinn_vector)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -13.000   1.000   3.000   3.845   6.000  46.000

# compare the first row of each vector using sign function
rbind(
  sign(head(syuzhet_vector)),
  sign(head(bing_vector)),
  sign(head(afinn_vector))
)

It is evident that most of the sentiments on review were positive because the Median using the three methods (syuzhet, bing and afinn) were positive values.

# running nrc sentiment analysis 
# NOTE: This process takes a great deal of RAM space so I limit the operation to be
# done on the first 1000 rows instead of the whole 14337 observations.

d <- get_nrc_sentiment(df$ReviewBody[1:1000])

#transpose
td<-data.frame(t(d))

# computing column sums across rows for each level of a grouping variable.
td_new <- data.frame(rowSums(td[2:1000]))

#Transformation and cleaning
names(td_new)[1] <- "count"
td_new <- cbind("sentiment" = rownames(td_new), td_new)
rownames(td_new) <- NULL
td_new2<-td_new[1:8,]

# counting of words associated with each sentiment
quickplot(sentiment, data=td_new2, weight=count, geom="bar", fill=sentiment, ylab="count")+ggtitle("Product Sentiments")

# in the plot, trust and anticipation are the top sentiments (almost the same score) 
# followed by joy sentiment and they are all positive sentiments. 

# expressed as a percentage
barplot(
  sort(colSums(prop.table(d[, 1:8]))), 
  horiz = TRUE, 
  cex.names = 0.7, 
  las = 1, 
  main = "Emotions in Text", xlab="Percentage"
)

Most of the reviews from the Amazon earphone products were positive and this signifies that majority of the people who used the product were pleases with the performance.

3.4 Topic Modelling on Amazon Earphone Reviews

3.4.1 LDA on Elon Musk Tweets

top_terms_by_topic_LDA <- function(input_text, plot = T, number_of_topics = 4) 
{    
  Corpus <- Corpus(VectorSource(input_text)) 
  DTM <- DocumentTermMatrix(Corpus) 
  
  unique_indexes <- unique(DTM$i) 
  DTM <- DTM[unique_indexes,] 
  
  lda <- LDA(DTM, k = number_of_topics, control = list(seed = 1234))
  topics <- tidy(lda, matrix = "beta")
  
  top_terms <- topics  %>% 
    group_by(topic) %>% 
    top_n(10, beta) %>% 
    ungroup() %>% 
    arrange(topic, -beta) 
  
  if(plot == T){
    top_terms %>% # take the top terms
      mutate(term = reorder(term, beta)) %>%  
      ggplot(aes(term, beta, fill = factor(topic))) + 
      geom_col(show.legend = FALSE) + 
      facet_wrap(~ topic, scales = "free") + 
      labs(x = NULL, y = "Beta") +  
      coord_flip() 
  }else{ 
    return(top_terms)
  }
}

# plot top ten terms in the tweets by topic
top_terms_by_topic_LDA(df$ReviewBody, number_of_topics = 2)

LDA generated a lot of stop words so the algorithm needs more improvement

# creating a document term matrix to clean
tweetsCorpus <- Corpus(VectorSource(df$ReviewBody)) 
tweetsDTM <- DocumentTermMatrix(tweetsCorpus)

# converting the document term matrix to a tidytext corpus
tweetsDTM_tidy <- tidy(tweetsDTM)

# remove stopwords and meaningless words
tweetsDTM_tidy_cleaned <- tweetsDTM_tidy %>% 
  anti_join(stop_words, by = c("term" = "word"))

# reconstruct cleaned documents (so that each word shows up the correct number of times)
cleaned_documents <- tweetsDTM_tidy_cleaned %>%
  group_by(document) %>% 
  mutate(terms = toString(rep(term, count))) %>%
  select(document, terms) %>%
  unique()

# now let's look at the new most informative terms
top_terms_by_topic_LDA(cleaned_documents$terms, number_of_topics = 2)

# stem the words (e.g. convert each word to its stem, where applicable)
tweetsDTM_tidy_cleaned <- tweetsDTM_tidy_cleaned %>% 
  mutate(stem = wordStem(term))

# reconstruct our documents
cleaned_documents <- tweetsDTM_tidy_cleaned %>%
  group_by(document) %>% 
  mutate(terms = toString(rep(stem, count))) %>%
  select(document, terms) %>%
  unique()

# now let's look at the new most informative terms
top_terms_by_topic_LDA(cleaned_documents$terms, number_of_topics = 2)

3.2.2 A Different Term Frequency Approach on Amazon Earphone Reviews

# using unnest_tokens()
tidy_review <- df %>%
  unnest_tokens(word, ReviewBody)

# counting words
tidy_review %>%
  count(word) %>%
  arrange(desc(n))

# using unnest_tokens() with stopwords
tidy_review2 <- df %>%
  unnest_tokens(word, ReviewBody) %>%
  anti_join(stop_words)

# the number of words is drastically reduced
tidy_review2

# counting words again
tidy_review2 %>%
  count(word) %>%
  arrange(desc(n))

# Visualizing text

# starting with tidy text
tidy_review <- df %>%
  mutate(id = row_number()) %>%  
  unnest_tokens(word, ReviewBody) %>%
  anti_join(stop_words)

tidy_review

# visualizing counts with geom_col() and filtering word count 
# filter() before visualizing
word_counts <- tidy_review %>%
  count(word) %>%
  filter(n>700) %>%
  arrange(desc(n))

# word count
ggplot(word_counts, aes(x=word, y=n)) + 
  geom_col() +
  coord_flip() +
  ggtitle("Review Word Counts")

## # A tibble: 1 x 2
##   word  lexicon
##   <chr> <chr>  
## 1 2     CUSTOM

## Joining, by = "word"

## # A tibble: 20 x 3
##    word             n word2       
##    <chr>        <int> <fct>       
##  1 awesome       1230 awesome     
##  2 bass          2831 bass        
##  3 battery       1937 battery     
##  4 bluetooth      813 bluetooth   
##  5 buy           1045 buy         
##  6 cancellation   780 cancellation
##  7 ear           1182 ear         
##  8 earphone      1064 earphone    
##  9 earphones     1547 earphones   
## 10 life           932 life        
## 11 money          781 money       
## 12 music          959 music       
## 13 nice          1053 nice        
## 14 noise         1086 noise       
## 15 price         1588 price       
## 16 product       4795 product     
## 17 quality       5970 quality     
## 18 range          747 range       
## 19 sound         6015 sound       
## 20 worth          731 worth

# ordered column
ggplot(word_counts, aes(x=word2, y=n)) + 
  geom_col() +
  coord_flip() +
  ggtitle("Review Word Counts")

# plot faceted by Product:
ggplot(word_counts, aes(x=word2, y=n, fill=Product)) + 
  geom_col(show.legend=FALSE) +
  facet_wrap(~Product, scales="free_y") +
  coord_flip() +
  ggtitle("Product Counts")

# Review Star Count
ggplot(word_counts, aes(x=word2, y=n, fill=ReviewStar)) + 
  geom_col(show.legend=FALSE) +
  facet_wrap(~ReviewStar, scales="free_y") +
  coord_flip() +
  ggtitle("Review Star Counts")

4. EVALUATION

This text mining project aimed at analyzing a large corpus of customer reviews of Amazon headphones and personal tweets of Elon Musk and determine the overall sentiments of the customers towards the product and the tweets respectively. The project used sentiment analysis techniques to classify each review as positive, negative, or neutral.

The datasets were pre-processed by removing stop words, stemming, and converting the text into numerical features using term frequency-inverse document frequency (TF-IDF) representation after which Sentimental Analysis and Topic Modelling algorithms were carried out.

The result of the sentimental analysis done on Elon Musk’s tweets showed that the sentiment attached to majority of his tweets were positive. Elon Musk was quite active in the year 2022 and majority of his tweets for the year were done in the month of May while he had relatively quite a few tweets for the month of January.

Topic Modelling performed on the same subject suggests that Musk’s tweets were time-specific. This is evident with the result of the TF-IDF chart that showed a range of variability in the tweets. For example, in the month of March, his top tweet was about Joe Biden and this was related to the reactions concerning the Russian-Ukraine war that was in progress around that period. It also proves that Musk’s tweets are dynamic and covers a wide range of engagements ranging from religion to politics and even showbiz is not left out.

For the Amazon earphone reviews, majority of the sentiments were positive while the topics with the highest frequency were on sound and quality while a few people talked about the worth of the products.

Also, it is important to note that according to the topic modelling performed on ReviewStar (stars given by the customer) which was between the range of 1(worst) to 5(best), the barchart for stars 1 and 2 contained evidences of bad reviews submitted. Some negative sentiments were expressed by customers with words like worst, poor, stopped. This was in contrast to the barchart for stars 4 and 5 which had evidences that customers actually praised the products they obtained with words like awesome, nice, quality which signified positive sentiments.

5. SUMMARY

The text mining project with sentiment analysis and topic modelling was successful in determining the overall sentiment of the customers towards the products and sentiments in Elon Musk’s tweets. According to the purpose of writing this project, the analysis done analysed and extracted relevant insights from the datasets after necessary data preprocessing had been done.

For the Amazon earphone reviews and considering the assumptions taken for this project, the results showed that majority of the customer reviews were positive and sentiments expressed had a relationship with the type of earphone bought.

In respect of Elon Musk’s tweets, results obtained through analysis portrayed that most of Musk’s tweets were positive and the contents of his tweets were dependent on the current happenings he was interested in which were not limited to the field of technology he was well known for.

The insights from the analysis can help the earphone companies improve their product and customer service. However, there are limitations in the subjectivity of customer reviews that needs to be addressed in future work.

Sentimental Analysis & Topic Modelling

Ayoola Ayetigbo, Joachim Ndhokero

2023-02-06