In this text and Sentiment analysis, the following topics will be covered:
Looking at the reviews of a McDonalds fast food restaurant and analysing the sentiments of its customers using the Bing (created by Bing Liu et al.) and NRC (created by Dr. saif Mohammad) sentiment dictionaries.
Carrying out a LDA topic modelling analysis of a sample gamestop product review database. This will be done using the collapsed Gibb’s sampling technique.
I will conduct a YouTube comments analysis, based on the comments under Mrwhosetheboss’ Iphone 16 & Iphone 16 Pro Unboxing and Review video. I will use this analysis to review the sentiments of customers surrounding the newest iPhones.
Setting Up
library(tidyverse)library(tidytext)
Warning: package 'tidytext' was built under R version 4.4.2
library(wordcloud)
Warning: package 'wordcloud' was built under R version 4.4.2
library(textdata)
Warning: package 'textdata' was built under R version 4.4.2
mcd_r <-read_csv("mcdonalds_reviews.csv")
Question 1
Question 1, part (a)
Q. Create a barchart visualising the top 20 most frequently occurring words, making sure to ignore stop words.
Finding the 20 most frequently occuring words in the database.
As can be seen, the count of negative word sentiments from the mcd_r tibble is aprox. 1500 higher than the count of positive sentiments. This means that reviews contain a lot more negative sentiments towards this particular McDonalds branch. It is safe to assume hence, that most reviews are negative.
Question 1, part (b (i)
Q. Analyse the sentiment of the reviews using the “Bing” sentiment dictionary. Find the most common words associated with “positive” and “negative” sentiment.
I chose to shorten the list to 10 words to see the most frequently occurring words associated with the positive and negative sentiments.
I placed these top occurring words associated with each sentiment into graph.
Positive Sentiments Graph:
ggplot(data = positive) +geom_col(mapping =aes(x = n, y =reorder(word, n))) +labs(y =NULL) +ggtitle("Top 10 Positive Sentiment Words") +xlab("Count of Word") +theme_minimal() +theme(panel.grid.major =element_line(color ="white"),axis.title.x =element_text(colour ="black", face ="bold"),axis.title.y =element_text(colour ="black", face ="bold"),axis.text =element_text(colour ="#666666"),axis.ticks =element_line(colour ="#666666"),axis.line =element_line(colour ="#666666"),title =element_text(colour ="black", face ="bold"),plot.title =element_text(face ="bold")) +theme(plot.title =element_text(hjust =0.5))
The most occurring positive sentiment words are fast, pretty, nice, hot and clean. This would suggest that customers are happy with the speed of the branch, that the building is pretty, nice and clean and that the food is hot.
Negative Sentiments Graph:
ggplot(data = negative) +geom_col(mapping =aes(x = n, y =reorder(word, n))) +labs(y =NULL) +ggtitle("Top 10 Negative Sentiment Words") +xlab("Count of Word") +theme_minimal() +theme(panel.grid.major =element_line(color ="white"),axis.title.x =element_text(colour ="black", face ="bold"),axis.title.y =element_text(colour ="black", face ="bold"),axis.text =element_text(colour ="#666666"),axis.ticks =element_line(colour ="#666666"),axis.line =element_line(colour ="#666666"),title =element_text(colour ="black", face ="bold"),plot.title =element_text(face ="bold")) +theme(plot.title =element_text(hjust =0.5))
The top negative sentiment words are worst, bad, wrong, slow and rude. This implies that customers are unhappy with the quality of the food, that the orders have been delivered with the wrong items, and that the service is slow and the staff tend to be rude.
Taking both these graphs into consideration, it appears that customer are happy with the state of the building, stating that it is pretty, nice and clean, however the service at this particular McDonalds branch needs improvement.
Question 1, part (b (ii)
Q. Create two barcharts showing how positive and negative sentiments have changed over time. Use blocks of size 150. Comment on your findings.
Here I separate the tibble into blocks and count the number of positive and negative sentiments of each block.
This information is then placed into a graph to analyse the change of negative and positive sentiments over time. By breaking down the data into blocks, we can analyse how sentiment progresses over time (the oldest reviews being in block one, and the newest sentiments being in block 10).
ggplot(mcd_r_blocks) +geom_col(mapping =aes(x = block, y = n)) +facet_wrap(~ sentiment, nrow =1) +ylab("Count of Sentiments") +ggtitle("Seniments Over Time") +xlab("Block") +theme_minimal() +theme(panel.grid.major =element_line(color ="white"),axis.title.x =element_text(colour ="black", face ="bold"),axis.title.y =element_text(colour ="black", face ="bold"),axis.text =element_text(colour ="#666666"),axis.ticks =element_line(colour ="#666666"),axis.line =element_line(colour ="#666666"),title =element_text(colour ="black", face ="bold"),plot.title =element_text(face ="bold")) +theme(plot.title =element_text(hjust =0.5))
The most obvious factor is how the negative sentiment graph has higher counts that the positive sentiment graph, meaning that customers tend to feel negative sentiments more than positive sentiments from the analysed reviews. This could suggest commonly referred to issues which the branch has, that should be looked into.
Negative sentiments tend to fluctuate over the time period based on this analysis, being the highest in block 1, where the earlier reviews are found. However, despite the fluctuation, the negative sentiments drop over time (such as when looking at blocks 6,7,8 and 9 despite the negative sentiments rising at those times, they never reach the height of block 1.) This suggest that the branch has worked on resolving certain issues.
Positive sentiment on the other hand is more even over time, but spikes in blocks 1,2 and 5. Taking the negative snetiment count into account, this suggests that certain customers are receiving good customer service whilst other are receiving bad customer service. This could imply service/ food inconsistency.
Question 1, part (c)
Q. Analyse the sentiment of the reviews using the “NRC” sentiment dictionary, which contains 10 different sentiments.
First I will join the NRC sentiments to the list of non stop words from the database.
Warning in inner_join(mcd_r_non_stop_words, nrc_sentiments): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 3 of `x` matches multiple rows in `y`.
ℹ Row 5143 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
Question 1, part (c (i)
Q. Find the most common words associated with each sentiment.
nrc_most_common <- mcd_r_nrc_sentiments %>%inner_join(get_sentiments("nrc")) %>%count(word, sentiment, sort =TRUE) %>%ungroup() %>%top_n(10)knitr::kable(nrc_most_common, "pipe", col.names =c("Word", "Sentiment", "Count"), align =c("l", "c", "c"), caption ="Most common word associated with each sentiment")
Most common word associated with each sentiment
Word
Sentiment
Count
food
joy
866
food
positive
866
food
trust
866
time
anticipation
522
customer
positive
186
bad
anger
185
bad
disgust
185
bad
fear
185
bad
negative
185
bad
sadness
185
Top positive sentiment words are food and customer, time refers to the emotion of anticipation as customers are waiting for their orders and bad refers to top negative sentiment words.
Question 1, part (c (ii)
Q. How frequently do each of the 10 sentiments appear in the reviews dataset?
count_nrc_sentiments <-count(mcd_r_nrc_sentiments, sentiment, sort =TRUE)ggplot(count_nrc_sentiments) +geom_col(mapping =aes(x = sentiment, y = n)) +ylab("Count of Sentiments") +ggtitle("Frequency of All NRC Sentiments") +xlab("Sentiment") +theme_minimal() +theme(panel.grid.major =element_line(color ="white"),axis.title.x =element_text(colour ="black", face ="bold"),axis.title.y =element_text(colour ="black", face ="bold"),axis.text =element_text(colour ="#666666"),axis.ticks =element_line(colour ="#666666"),axis.line =element_line(colour ="#666666"),title =element_text(colour ="black", face ="bold"),plot.title =element_text(face ="bold")) +theme(plot.title =element_text(hjust =0.5))
The highest NRC sentiment is positive, and second highest is negative. The lowest sentiment is surprise. Trust and anticipation are also honorable mentions as they are 3rd and 4th highest sentiments. This means that the branch is seen as trustworthy. Anticipation is high (this is due to time falling under anticipation). If many reviews are mentioning time, this could indicate that the service at this branch is slow and there is long waiting times.
Question 1, part (d)
Q. Find the top 20 most frequently occurring bigrams, making sure to remove bigrams that contain a stop word.
‘Worst McDonalds’ occurs twice in this bigram list, indicating that customers are highly unsatisfied with the branch. Customer service and ice cream is also often seen frequently. Time frames such as 10 minutes, 15 minutes and 20 minutes also appear often, these time frames might be referring to waiting times.
Question 1, part (e)
Q. Find the top 20 most frequently occurring trigrams, making sure to remove trigrams that contain a stop word.
Ice cream machine, worst customer service and 24 hour drive are the top most frequently occurring trigrams. These indicate that there might be issue regarding the ice cream machine and the bad quality of the customer service. 24 hour drive might be referring to 24 hour open drive thru.
Question 1, part (f (i)
Q. Find all reviews that contain the word “waiting” and export these reviews to a .csv file (make sure your results are case-insensitive). Read 10 of the reviews (or all the reviews if less than 10 are returned) and summarise the context in which reviewers are referring to “waiting”.
From the above analysis, waiting is a frequently mentioned word. After filtering out all reviews which mention waiting and reading the top 10 reviews, the context of waiting in reviews is as follows:
This McDonalds branch has long waiting times in the building and in the drive thru. The reviews also mention the slow speed of staff and calling staff lazy. Lack of management, unprofessionalism, lost orders and cold food are also covered.
Question 1, part (f (ii)
Q. Find all reviews that contain the word “shamrock shake” and export these reviews to a .csv file (make sure your results are case-insensitive). Read 10 (or all the reviews if less than 10 are returned) of the reviews and summarise the context in which reviewers are referring to “shamrock shake”.
After filtering out all reviews which mention the season Shamrock Shake and reading the top 10 reviews, the context of shamrock shakes in reviews is as follows:
There is mixed reviews about the shamrock shake, certain customer referring to it as decent and enjoyable, whilst others comment on the ‘obviously fake flavouring’, the lack of strength in the flavour and also mentioning that the drink is badly mixed and tends to be chalky.
Question 1, part (f (iii)
Q. Find all reviews that contain the word “ice cream machine” and export these reviews to a .csv file (make sure your results are case-insensitive). Read 10 (or all the reviews if less than 10 are returned) of the reviews and summarise the context in which reviewers are referring to “ice cream machine”.
From the above analysis, the ice cream machine is a frequently mentioned topic. After filtering out all reviews which mention the ice cream machine and reading the top 10 reviews, the context of the ice cream machine in reviews is as follows:
The reviews mention that the ice cream is often either broken down or locked, especially during the night time. It is also mentioned that the staff are rude about the ice cream machine and that they are simply not bothered to fill it which is why it is often unavailable. A wrong ice cream machine order is also mentioned.
Question 1, part (g)
Q. Create two coloured word clouds, one showing the most common non-stopwords that are classified as “positive” and another showing the most common non-stopwords that are classified as “negative”. Include only those words that appear at least 50 times in the “positive” word cloud and at least 50 times in the “negative” word cloud.
In this section, an analysis of a Gamestop reviews data base will be carried out using the Collapsed Gibbs Sampling Method.
Setting up for analysis:
library(topicmodels)
Warning: package 'topicmodels' was built under R version 4.4.2
library(topicdoc)
Warning: package 'topicdoc' was built under R version 4.4.2
library(reshape2)
Warning: package 'reshape2' was built under R version 4.4.2
Question 2, part (a)
Q. Correctly prepare the data by removing stop words, counting the number of times each word appears in each review and converting to a Document Term Matrix.
gmstp <-read_csv("gamestop_product_reviews.csv")my_stop_words <-bind_rows(stop_words, tibble(word =c("im", "ive", "id", "theyve", "theyre", "dont")))gmstp_no_stop_words <- gmstp %>%unnest_tokens(word, review, token ="words") %>%anti_join(my_stop_words)gmstp_word_counts <-count(gmstp_no_stop_words, id, word, sort =TRUE)gmstp_dtm <-cast_dtm(gmstp_word_counts, document = id, term = word, value = n)
Question 2, part (b)
Q. When using the LDA() function:
Use the Collapsed Gibb’s Sampling method.
Set the random seed at 1234.
You can choose whatever value of k that you think leads to interesting and actionable topics (note that not all topics need to be interesting and actionable), although you should use a value of k ≥ 10. Finding a suitable value of k may require some trial and error, which means you may end up creating several different LDA models with different values of k, but only the final LDA model should be included in this report.
gmstp_lda <-LDA(gmstp_dtm, method ="Gibbs", k =12, control =list(seed =1234))gmstp_lda_beta <-tidy(gmstp_lda)gmstp_lda_top_terms <- gmstp_lda_beta %>%group_by(topic) %>%slice_max(beta, n =12, with_ties =FALSE) %>%ungroup() %>%arrange(topic, -beta)
Question 2, part (c (i)
Q. Assess the quality of the topics produced by the LDA algorithm using the following methods:
i. Visually: create barcharts showing the top 10 terms for each topic and write a summary of what you think each topic is focussing on.
I found out that 12 topics gave the most actionable results.
gmstp_lda_top_terms %>%mutate(term =reorder_within(term, beta, topic)) %>%group_by(topic, term) %>%arrange(desc(beta)) %>%ungroup() %>%ggplot(aes(beta, term, fill =as.factor(topic))) +geom_col(show.legend =FALSE) +scale_y_reordered() +labs(title ="Top 12 terms in each LDA topic", x =expression(beta), y =NULL) +facet_wrap(~ topic, ncol =6, scales ="free")
Topic 1 includes words such as TV, Picture and Headset and Sound, suggesting a topic relating to the quality of headphones when plugged up to a TV to play a video game on the big screen. The words quality, comfortable and features suggest a good product experience when using the headphones.
Topics 2, 4, 7, 11 and 12 are all topics which refer to different games such as Zelda, Fallout 3 and 4 and Pokemon Black and White, whereas other topics mention the gaming experience without mentioning any particular game titles.
Words such as kids, son, nice, cute, loves and easy indicate that topic 3 and 10 focuses on purchases for kids which were positively received.
Topic 5 focuses on Xbox and Nintendo Switch gaming. The words awesome, fan, recommend, love and design indicate a good experience with these products.
Topic 6 focuses on a Samsung computer monitor. Positive sentiment words such as amazing, gaming, perfect, quality, movies and streaming indicate that the screen is very good for multiple purposes such as for playing games or watching videos. This indicated that the monitors quality is high.
Topic 8 includes words such as batteries, energizer, lasting and recommend. This indicates a positive experience with energizer alkaline batteries for use in a flashlight.
Topic 9 includes many numbers and negative sentiment words such as worth, bad, honestly and money. Indicating a negative experience with an expensive product, however this product is not mentioned making this topic not particularly actionable.
Question 2, part (c (i)
Q. Assess the quality of the topics produced by the LDA algorithm using the following methods:
Numerically: calculate the topic size, mean token length, topic coherence and topic exclusivity. Comment on your findings. For example, which topics appear to have the highest quality? Lowest quality?
Topic 8 has the coherence score closest to 0 making it the most coherent topic.
Topic 3 has the score farthest away from 0, making it the least coherent topic.
Topic 12 has the biggest topic size, whilst topic 8 has the smallest topic size.
All topic exclusivity ratings are very similar making them less distinct and more difficult to interpret. There is a lot of overlap in the wordss found in each topic, and the words game, games, and gameplay appear in multiple topics meaning that topics are not as distinct and more difficult to define.
Question 2, part (d)
Q. Based on your analysis, suggest some actions that could be taken by GameStop to improve their business.
Dig deeper to find out the product(s) being discussed in topic 9 and remove the prodcut from their catalog if its not meeting customer expectations.
Question 3
For this section, I will analyse the YouTube Comments under ‘Mrwhosetheboss’ YouTube Review and unboxing of the newest iPhone 16 and 16 Pro.
Full Video Title:iPhone 16 / 16 Pro Unboxing - Testing every new feature!
ggplot(data = iphone_counts) +geom_col(mapping =aes(x = n, y =reorder(word, n))) +labs(y =NULL) +ggtitle("Top 20 Most Frequently Occuring Words") +xlab("Count") +theme_minimal() +theme(panel.grid.major =element_line(color ="white"),axis.title.x =element_text(colour ="black", face ="bold"),axis.text =element_text(colour ="#666666"),axis.ticks =element_line(colour ="#666666"),axis.line =element_line(colour ="#666666"),title =element_text(colour ="black", face ="bold"),plot.title =element_text(face ="bold")) +theme(plot.title =element_text(hjust =0.5))
The top most frequently occurring words are iPhone, Apple, 16, pro and phone. Other frequently mentioned words are video, camera and battery, making these frequently discussed topics surrounding the features of the newest iPhone.
Bing Dictionary Analysis
I will test commenter sentiments using the Bing dictionary to get a better idea of whether the comments are positive or negative surrounding the newest iPhone launch.
As can be seen, the commenters sentiments are aprox. 50/50, with positive being sentiments being higher by just a little. This means that the launch was taken positively in majority, with a slim margin. The negative sentiments seen can be from apprehension surrounding the newest features, and whether they are worth Apple’s premium price point.
I will look into this further to see which most frequently used words fit under positive and negative sentiments.
The words boring, crazy, expensive, dumb and bad are the most frequently used words surrounding negative sentiments.
NRC Dictionary Analysis
As the NRC dictionary contains 10 differnet sentiments, doing a sentiment analysis with it will give us a better idea of more precise customers sentiments surrounding this launch.
Warning in inner_join(iphone_reviews_non_stop_words, nrc_sentiments): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 19 of `x` matches multiple rows in `y`.
ℹ Row 13775 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
Sentiments falling under positive are the most frequent, with trust, anticipation and joy being the positive sentiments with the highest counts.
This indicates that Apple customers are excited surrounding the launch, and there is a sense of anticipation which perhaps relates to upgrading their current phone to the newest iPhone 16 or iPhone 16 Pro. The sentiment of trust relates to the customers trust toward the Apple brand and its products, which is a big positive for Apple.
The biggest negative sentiment towards the newest iPhone launch is anger, which indicates that customers with negative opinions of this launch are angry about something. I will filter out angry sentiments to find the most frequently used words correlating to anger.
iphone_reviews_neg_sentiments <-filter(iphone_reviews_nrc_sentiments, sentiment =="anger")top_angry_sentiments <-count(iphone_reviews_nrc_sentiments, word, sort =TRUE) %>%top_n(15)knitr::kable(top_angry_sentiments, "pipe", col.names =c("Word", "Count"), align =c("l", "c"), caption ="Count of Most Frequent Words Relating to Anger")
Count of Most Frequent Words Relating to Anger
Word
Count
money
570
battery
326
intelligence
324
finally
276
love
242
wait
232
bad
220
crazy
212
hate
175
hope
165
feature
162
time
162
honest
161
pretty
148
excited
135
The most frequent words used under the anger sentiment are money and battery, indicating that Apple customers are angry about the price point of the newest phones as well as the longevity of the phones battery.