Text and Sentiment Analysis

Author

Joanna Zaremba, C00258740

In this text and Sentiment analysis, the following topics will be covered:

Setting Up

library(tidyverse)
library(tidytext)
Warning: package 'tidytext' was built under R version 4.4.2
library(wordcloud)
Warning: package 'wordcloud' was built under R version 4.4.2
library(textdata)
Warning: package 'textdata' was built under R version 4.4.2
mcd_r <- read_csv("mcdonalds_reviews.csv")

Question 1

Question 1, part (a)

Q. Create a barchart visualising the top 20 most frequently occurring words, making sure to ignore stop words.

Finding the 20 most frequently occuring words in the database.

mcd_r_counts <- mcd_r %>% 
  unnest_tokens(word, review, token = "words") %>%
  anti_join(stop_words) %>%
  count(word, sort = TRUE) %>%
  top_n(20)

I created a graph displaying the above 20 most frequently occurring words for easier analysis.

ggplot(mcd_r_counts) +
  geom_col(mapping = aes(x = n, y = reorder(word, n))) +
  labs(y = NULL) +
  ggtitle("Top 20 Most Frequently Occuring Words") +
  xlab("Count of Word") +
  theme_minimal() +
  theme(panel.grid.major = element_line(color = "white"),
        axis.title.x = element_text(colour = "black", face = "bold"),
        axis.title.y = element_text(colour = "black", face = "bold"),
        axis.text = element_text(colour = "#666666"),
        axis.ticks = element_line(colour = "#666666"),
        axis.line = element_line(colour = "#666666"),
        title = element_text(colour = "black", face = "bold"),
        plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5)) 

As can be seen, the words food, McDonalds, drive, time and service occur the most frequently. This means that many reviews mention these words.

Question 1, part (b)

Q. Analyse the sentiment of the reviews using the “Bing” sentiment dictionary.

First I will join the Bing sentiments to the list of non stop words from the database.

bing_sentiments <- get_sentiments("bing")

mcd_r_non_stop_words<- mcd_r %>% 
  unnest_tokens(word, review, token = "words") %>%
  anti_join(stop_words)

mcd_r_bing_sentiments <- inner_join(mcd_r_non_stop_words, bing_sentiments)

I placed the data in a table for easier analysis.

count_mcd_r_sentiments <- mcd_r_bing_sentiments %>% 
  anti_join(stop_words) %>%
  count(sentiment, sort = TRUE)

knitr::kable(count_mcd_r_sentiments, "pipe", col.names = c("Sentiment", "Count"), align = c("l", "c"), caption = "Counts of McDonald's reviews sentiments")
Counts of McDonald’s reviews sentiments
Sentiment Count
negative 4595
positive 3009

As can be seen, the count of negative word sentiments from the mcd_r tibble is aprox. 1500 higher than the count of positive sentiments. This means that reviews contain a lot more negative sentiments towards this particular McDonalds branch. It is safe to assume hence, that most reviews are negative.

Question 1, part (b (i)

Q. Analyse the sentiment of the reviews using the “Bing” sentiment dictionary. Find the most common words associated with “positive” and “negative” sentiment.

I chose to shorten the list to 10 words to see the most frequently occurring words associated with the positive and negative sentiments.

positive <- mcd_r_bing_sentiments %>%
  filter(sentiment == "positive") %>%
  count(word, sort = TRUE) %>%
  top_n(10)

negative <- mcd_r_bing_sentiments %>%
  filter(sentiment == "negative") %>%
  count(word, sort = TRUE) %>%
  top_n(10)

I placed these top occurring words associated with each sentiment into graph.

Positive Sentiments Graph:

ggplot(data = positive) +
  geom_col(mapping = aes(x = n, y = reorder(word, n))) +
  labs(y = NULL) +
  ggtitle("Top 10 Positive Sentiment Words") +
  xlab("Count of Word") +
  theme_minimal() +
  theme(panel.grid.major = element_line(color = "white"),
        axis.title.x = element_text(colour = "black", face = "bold"),
        axis.title.y = element_text(colour = "black", face = "bold"),
        axis.text = element_text(colour = "#666666"),
        axis.ticks = element_line(colour = "#666666"),
        axis.line = element_line(colour = "#666666"),
        title = element_text(colour = "black", face = "bold"),
        plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5)) 

The most occurring positive sentiment words are fast, pretty, nice, hot and clean. This would suggest that customers are happy with the speed of the branch, that the building is pretty, nice and clean and that the food is hot.

Negative Sentiments Graph:

ggplot(data = negative) +
  geom_col(mapping = aes(x = n, y = reorder(word, n))) +
  labs(y = NULL) +
   ggtitle("Top 10 Negative Sentiment Words") +
  xlab("Count of Word") +
  theme_minimal() +
  theme(panel.grid.major = element_line(color = "white"),
        axis.title.x = element_text(colour = "black", face = "bold"),
        axis.title.y = element_text(colour = "black", face = "bold"),
        axis.text = element_text(colour = "#666666"),
        axis.ticks = element_line(colour = "#666666"),
        axis.line = element_line(colour = "#666666"),
        title = element_text(colour = "black", face = "bold"),
        plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5)) 

The top negative sentiment words are worst, bad, wrong, slow and rude. This implies that customers are unhappy with the quality of the food, that the orders have been delivered with the wrong items, and that the service is slow and the staff tend to be rude.

Taking both these graphs into consideration, it appears that customer are happy with the state of the building, stating that it is pretty, nice and clean, however the service at this particular McDonalds branch needs improvement.

Question 1, part (b (ii)

Q. Create two barcharts showing how positive and negative sentiments have changed over time. Use blocks of size 150. Comment on your findings.

Here I separate the tibble into blocks and count the number of positive and negative sentiments of each block.

mcd_r_bing_sentiments <- mcd_r %>% 
  unnest_tokens(word, review, token = "words") %>%
  anti_join(stop_words) %>%
  inner_join(sentiments)

mcd_r_block_sentiments <- mutate(mcd_r_bing_sentiments, block = id%/%150)

mcd_r_blocks <- mcd_r_block_sentiments %>%
  group_by(block) %>%
  count(sentiment)

This information is then placed into a graph to analyse the change of negative and positive sentiments over time. By breaking down the data into blocks, we can analyse how sentiment progresses over time (the oldest reviews being in block one, and the newest sentiments being in block 10).

ggplot(mcd_r_blocks) +
  geom_col(mapping = aes(x = block, y = n)) +
  facet_wrap(~ sentiment, nrow = 1) +
  ylab("Count of Sentiments") + 
     ggtitle("Seniments Over Time") +
  xlab("Block") +
  theme_minimal() +
  theme(panel.grid.major = element_line(color = "white"),
        axis.title.x = element_text(colour = "black", face = "bold"),
        axis.title.y = element_text(colour = "black", face = "bold"),
        axis.text = element_text(colour = "#666666"),
        axis.ticks = element_line(colour = "#666666"),
        axis.line = element_line(colour = "#666666"),
        title = element_text(colour = "black", face = "bold"),
        plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5)) 

The most obvious factor is how the negative sentiment graph has higher counts that the positive sentiment graph, meaning that customers tend to feel negative sentiments more than positive sentiments from the analysed reviews. This could suggest commonly referred to issues which the branch has, that should be looked into.

Negative sentiments tend to fluctuate over the time period based on this analysis, being the highest in block 1, where the earlier reviews are found. However, despite the fluctuation, the negative sentiments drop over time (such as when looking at blocks 6,7,8 and 9 despite the negative sentiments rising at those times, they never reach the height of block 1.) This suggest that the branch has worked on resolving certain issues.

Positive sentiment on the other hand is more even over time, but spikes in blocks 1,2 and 5. Taking the negative snetiment count into account, this suggests that certain customers are receiving good customer service whilst other are receiving bad customer service. This could imply service/ food inconsistency.

Question 1, part (c)

Q. Analyse the sentiment of the reviews using the “NRC” sentiment dictionary, which contains 10 different sentiments.

First I will join the NRC sentiments to the list of non stop words from the database.

nrc_sentiments <- sentiments <- get_sentiments("nrc")
nrc_sentiments
# A tibble: 13,872 × 2
   word        sentiment
   <chr>       <chr>    
 1 abacus      trust    
 2 abandon     fear     
 3 abandon     negative 
 4 abandon     sadness  
 5 abandoned   anger    
 6 abandoned   fear     
 7 abandoned   negative 
 8 abandoned   sadness  
 9 abandonment anger    
10 abandonment fear     
# ℹ 13,862 more rows
mcd_r_non_stop_words<- mcd_r %>% 
  unnest_tokens(word, review, token = "words") %>%
  anti_join(stop_words)

mcd_r_nrc_sentiments <- inner_join(mcd_r_non_stop_words, nrc_sentiments)
Warning in inner_join(mcd_r_non_stop_words, nrc_sentiments): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 3 of `x` matches multiple rows in `y`.
ℹ Row 5143 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

Question 1, part (c (i)

Q. Find the most common words associated with each sentiment.

nrc_most_common <- mcd_r_nrc_sentiments %>%
  inner_join(get_sentiments("nrc")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup() %>%
  top_n(10)

knitr::kable(nrc_most_common, "pipe", col.names = c("Word", "Sentiment", "Count"), align = c("l", "c", "c"), caption = "Most common word associated with each sentiment")
Most common word associated with each sentiment
Word Sentiment Count
food joy 866
food positive 866
food trust 866
time anticipation 522
customer positive 186
bad anger 185
bad disgust 185
bad fear 185
bad negative 185
bad sadness 185

Top positive sentiment words are food and customer, time refers to the emotion of anticipation as customers are waiting for their orders and bad refers to top negative sentiment words.

Question 1, part (c (ii)

Q. How frequently do each of the 10 sentiments appear in the reviews dataset?

count_nrc_sentiments <- count(mcd_r_nrc_sentiments, sentiment, sort = TRUE)

ggplot(count_nrc_sentiments) +
  geom_col(mapping = aes(x = sentiment, y = n)) +
  ylab("Count of Sentiments") + 
     ggtitle("Frequency of All NRC Sentiments") +
  xlab("Sentiment") +
  theme_minimal() +
  theme(panel.grid.major = element_line(color = "white"),
        axis.title.x = element_text(colour = "black", face = "bold"),
        axis.title.y = element_text(colour = "black", face = "bold"),
        axis.text = element_text(colour = "#666666"),
        axis.ticks = element_line(colour = "#666666"),
        axis.line = element_line(colour = "#666666"),
        title = element_text(colour = "black", face = "bold"),
        plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5)) 

The highest NRC sentiment is positive, and second highest is negative. The lowest sentiment is surprise. Trust and anticipation are also honorable mentions as they are 3rd and 4th highest sentiments. This means that the branch is seen as trustworthy. Anticipation is high (this is due to time falling under anticipation). If many reviews are mentioning time, this could indicate that the service at this branch is slow and there is long waiting times.

Question 1, part (d)

Q. Find the top 20 most frequently occurring bigrams, making sure to remove bigrams that contain a stop word.

mcd_r_bigrams <- mcd_r %>% 
  unnest_tokens(bigram, review, token = "ngrams", n = 2) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  unite(bigram, word1, word2, sep = " ")

bigram_counts <- count(mcd_r_bigrams, bigram, sort = TRUE) %>%
  top_n(20)

knitr::kable(bigram_counts, "pipe", col.names = c("Bigram", "Count"), align = c("l", "c"), caption = "20 Most Frequently Occuring Bigrams")
20 Most Frequently Occuring Bigrams
Bigram Count
fast food 153
customer service 116
ice cream 61
worst mcdonalds 52
10 minutes 49
parking lot 43
worst mcdonald’s 42
15 minutes 39
chicken nuggets 38
french fries 34
mickey d’s 33
20 minutes 32
5 minutes 29
iced coffee 29
dollar menu 28
late night 28
sweet tea 27
24 hours 25
chicken sandwich 23
quarter pounder 23

‘Worst McDonalds’ occurs twice in this bigram list, indicating that customers are highly unsatisfied with the branch. Customer service and ice cream is also often seen frequently. Time frames such as 10 minutes, 15 minutes and 20 minutes also appear often, these time frames might be referring to waiting times.

Question 1, part (e)

Q. Find the top 20 most frequently occurring trigrams, making sure to remove trigrams that contain a stop word.

mcd_r_trigrams <- mcd_r %>% 
  unnest_tokens(bigram, review, token = "ngrams", n = 3) %>%
  separate(bigram, c("word1", "word2", "word3"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  filter(!word3 %in% stop_words$word) %>%
  unite(bigram, word1, word2, word3, sep = " ") 

trigram_counts <- count(mcd_r_trigrams, bigram, sort = TRUE) %>%
  top_n(20)

knitr::kable(trigram_counts, "pipe", col.names = c("Trigram", "Count"), align = c("l", "c"), caption = "20 Most Frequently Occuring Trigrams")
20 Most Frequently Occuring Trigrams
Trigram Count
ice cream machine 10
worst customer service 10
24 hour drive 9
eat fast food 8
fast food restaurants 8
ice cream cone 8
10 piece chicken 7
fast food restaurant 7
sausage egg mcmuffin 7
terrible customer service 7
free wi fi 6
ice cream cones 6
piece chicken nugget 5
piece chicken nuggets 5
worst fast food 5
2 apple pies 4
5 10 minutes 4
bad customer service 4
double cheese burger 4
fast food chain 4
fast food joint 4
fast food joints 4
sausage egg biscuit 4
spicy mcchicken sandwich 4
wait 10 minutes 4
waited 15 minutes 4

Ice cream machine, worst customer service and 24 hour drive are the top most frequently occurring trigrams. These indicate that there might be issue regarding the ice cream machine and the bad quality of the customer service. 24 hour drive might be referring to 24 hour open drive thru.

Question 1, part (f (i)

Q. Find all reviews that contain the word “waiting” and export these reviews to a .csv file (make sure your results are case-insensitive). Read 10 of the reviews (or all the reviews if less than 10 are returned) and summarise the context in which reviewers are referring to “waiting”.

From the above analysis, waiting is a frequently mentioned word. After filtering out all reviews which mention waiting and reading the top 10 reviews, the context of waiting in reviews is as follows:

This McDonalds branch has long waiting times in the building and in the drive thru. The reviews also mention the slow speed of staff and calling staff lazy. Lack of management, unprofessionalism, lost orders and cold food are also covered.

Question 1, part (f (ii)

Q. Find all reviews that contain the word “shamrock shake” and export these reviews to a .csv file (make sure your results are case-insensitive). Read 10 (or all the reviews if less than 10 are returned) of the reviews and summarise the context in which reviewers are referring to “shamrock shake”.

After filtering out all reviews which mention the season Shamrock Shake and reading the top 10 reviews, the context of shamrock shakes in reviews is as follows:

There is mixed reviews about the shamrock shake, certain customer referring to it as decent and enjoyable, whilst others comment on the ‘obviously fake flavouring’, the lack of strength in the flavour and also mentioning that the drink is badly mixed and tends to be chalky.

Question 1, part (f (iii)

Q. Find all reviews that contain the word “ice cream machine” and export these reviews to a .csv file (make sure your results are case-insensitive). Read 10 (or all the reviews if less than 10 are returned) of the reviews and summarise the context in which reviewers are referring to “ice cream machine”.

From the above analysis, the ice cream machine is a frequently mentioned topic. After filtering out all reviews which mention the ice cream machine and reading the top 10 reviews, the context of the ice cream machine in reviews is as follows:

The reviews mention that the ice cream is often either broken down or locked, especially during the night time. It is also mentioned that the staff are rude about the ice cream machine and that they are simply not bothered to fill it which is why it is often unavailable. A wrong ice cream machine order is also mentioned.

Question 1, part (g)

Q. Create two coloured word clouds, one showing the most common non-stopwords that are classified as “positive” and another showing the most common non-stopwords that are classified as “negative”. Include only those words that appear at least 50 times in the “positive” word cloud and at least 50 times in the “negative” word cloud.

count_mcd_r_non_stop_words<- mcd_r %>% 
  unnest_tokens(word, review, token = "words") %>%
  anti_join(stop_words) %>%
  count(word, sort = TRUE)

count_mcd_r_bing_sentiments <- inner_join(count_mcd_r_non_stop_words, bing_sentiments)

Positive Wordcloud

#positive wordcloud#
mcd_pos_sentiments <- filter(count_mcd_r_bing_sentiments, sentiment == "positive")

wordcloud(mcd_pos_sentiments$word, 
          mcd_pos_sentiments$n, 
          min.freq = 50, 
          colors = brewer.pal(8, "Reds"))

Negative Wordcloud

#negative wordcloud# 
mcd_neg_sentiments <- filter(count_mcd_r_bing_sentiments, sentiment == "negative")

wordcloud(mcd_neg_sentiments$word, 
          mcd_neg_sentiments$n, 
          min.freq = 50, 
          colors = brewer.pal(8, "Blues"))

Question 2

In this section, an analysis of a Gamestop reviews data base will be carried out using the Collapsed Gibbs Sampling Method.

Setting up for analysis:

library(topicmodels)
Warning: package 'topicmodels' was built under R version 4.4.2
library(topicdoc)
Warning: package 'topicdoc' was built under R version 4.4.2
library(reshape2)
Warning: package 'reshape2' was built under R version 4.4.2

Question 2, part (a)

Q. Correctly prepare the data by removing stop words, counting the number of times each word appears in each review and converting to a Document Term Matrix.

gmstp <- read_csv("gamestop_product_reviews.csv")

my_stop_words <- bind_rows(stop_words, 
                           tibble(word = c("im", "ive", "id", "theyve", "theyre", "dont")))

gmstp_no_stop_words <- gmstp %>%
  unnest_tokens(word, review, token = "words") %>%
  anti_join(my_stop_words)

gmstp_word_counts <- count(gmstp_no_stop_words, id, word, sort = TRUE)

gmstp_dtm <- cast_dtm(gmstp_word_counts, document = id, term = word, value = n)

Question 2, part (b)

Q. When using the LDA() function:

  1. Use the Collapsed Gibb’s Sampling method.
  2. Set the random seed at 1234.
  3. You can choose whatever value of k that you think leads to interesting and actionable topics (note that not all topics need to be interesting and actionable), although you should use a value of k ≥ 10. Finding a suitable value of k may require some trial and error, which means you may end up creating several different LDA models with different values of k, but only the final LDA model should be included in this report.
gmstp_lda <- LDA(gmstp_dtm, method = "Gibbs", k = 12, control = list(seed = 1234))

gmstp_lda_beta <- tidy(gmstp_lda)

gmstp_lda_top_terms <- gmstp_lda_beta %>%
  group_by(topic) %>%
  slice_max(beta, n = 12, with_ties = FALSE) %>%
  ungroup() %>%
  arrange(topic, -beta)

Question 2, part (c (i)

Q. Assess the quality of the topics produced by the LDA algorithm using the following methods:

i. Visually: create barcharts showing the top 10 terms for each topic and write a summary of what you think each topic is focussing on.

I found out that 12 topics gave the most actionable results.

gmstp_lda_top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  group_by(topic, term) %>%    
  arrange(desc(beta)) %>%  
  ungroup() %>%
  ggplot(aes(beta, term, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  labs(title = "Top 12 terms in each LDA topic", x = expression(beta), y = NULL) +
  facet_wrap(~ topic, ncol = 6, scales = "free")

  • Topic 1 includes words such as TV, Picture and Headset and Sound, suggesting a topic relating to the quality of headphones when plugged up to a TV to play a video game on the big screen. The words quality, comfortable and features suggest a good product experience when using the headphones.

  • Topics 2, 4, 7, 11 and 12 are all topics which refer to different games such as Zelda, Fallout 3 and 4 and Pokemon Black and White, whereas other topics mention the gaming experience without mentioning any particular game titles.

  • Words such as kids, son, nice, cute, loves and easy indicate that topic 3 and 10 focuses on purchases for kids which were positively received.

  • Topic 5 focuses on Xbox and Nintendo Switch gaming. The words awesome, fan, recommend, love and design indicate a good experience with these products.

  • Topic 6 focuses on a Samsung computer monitor. Positive sentiment words such as amazing, gaming, perfect, quality, movies and streaming indicate that the screen is very good for multiple purposes such as for playing games or watching videos. This indicated that the monitors quality is high.

  • Topic 8 includes words such as batteries, energizer, lasting and recommend. This indicates a positive experience with energizer alkaline batteries for use in a flashlight.

  • Topic 9 includes many numbers and negative sentiment words such as worth, bad, honestly and money. Indicating a negative experience with an expensive product, however this product is not mentioned making this topic not particularly actionable.

Question 2, part (c (i)

Q. Assess the quality of the topics produced by the LDA algorithm using the following methods:

  1. Numerically: calculate the topic size, mean token length, topic coherence and topic exclusivity. Comment on your findings. For example, which topics appear to have the highest quality? Lowest quality?
topic_quality <- topic_diagnostics(gmstp_lda, gmstp_dtm)

Based on the Topic Quality numerical analysis;

  • Topic 8 has the coherence score closest to 0 making it the most coherent topic.

  • Topic 3 has the score farthest away from 0, making it the least coherent topic.

  • Topic 12 has the biggest topic size, whilst topic 8 has the smallest topic size.

  • All topic exclusivity ratings are very similar making them less distinct and more difficult to interpret. There is a lot of overlap in the wordss found in each topic, and the words game, games, and gameplay appear in multiple topics meaning that topics are not as distinct and more difficult to define.

Question 2, part (d)

Q. Based on your analysis, suggest some actions that could be taken by GameStop to improve their business.

  • Dig deeper to find out the product(s) being discussed in topic 9 and remove the prodcut from their catalog if its not meeting customer expectations.

Question 3

For this section, I will analyse the YouTube Comments under ‘Mrwhosetheboss’ YouTube Review and unboxing of the newest iPhone 16 and 16 Pro.

Full Video Title: iPhone 16 / 16 Pro Unboxing - Testing every new feature!

Creator: Mrwhosetheboss

Youtube Video Link

First I will set up the YouTube Comments download from this video using an API key.

Warning: package 'vosonSML' was built under R version 4.4.2

20 Most Frequent Words under this YouTube Video

iphone_counts <- iphone_reviews %>% 
  unnest_tokens(word, comments, token = "words") %>%
  anti_join(stop_words) %>%
  count(word, sort = TRUE) %>%
  top_n(20)
ggplot(data = iphone_counts) +
  geom_col(mapping = aes(x = n, y = reorder(word, n))) +
  labs(y = NULL) +
  ggtitle("Top 20 Most Frequently Occuring Words") +
  xlab("Count") +
  theme_minimal() +
  theme(panel.grid.major = element_line(color = "white"),
        axis.title.x = element_text(colour = "black", face = "bold"),
        axis.text = element_text(colour = "#666666"),
        axis.ticks = element_line(colour = "#666666"),
        axis.line = element_line(colour = "#666666"),
        title = element_text(colour = "black", face = "bold"),
        plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5)) 

The top most frequently occurring words are iPhone, Apple, 16, pro and phone. Other frequently mentioned words are video, camera and battery, making these frequently discussed topics surrounding the features of the newest iPhone.

Bing Dictionary Analysis

I will test commenter sentiments using the Bing dictionary to get a better idea of whether the comments are positive or negative surrounding the newest iPhone launch.

bing_sentiments <- get_sentiments("bing")

iphone_reviews_non_stop_words<- iphone_reviews %>% 
  unnest_tokens(word, comments, token = "words") %>%
  anti_join(stop_words)

iphone_reviews_bing_sentiments <- inner_join(iphone_reviews_non_stop_words, bing_sentiments)

count_iphone_reviews_bing_sentiments <- iphone_reviews_bing_sentiments %>% 
  anti_join(stop_words) %>%
  count(sentiment, sort = TRUE)

knitr::kable(count_iphone_reviews_bing_sentiments, "pipe", col.names = c("Sentiment", "Count"), align = c("l", "c"), caption = "Count of Positive and Negative Sentiments")
Count of Positive and Negative Sentiments
Sentiment Count
positive 1997
negative 1931

As can be seen, the commenters sentiments are aprox. 50/50, with positive being sentiments being higher by just a little. This means that the launch was taken positively in majority, with a slim margin. The negative sentiments seen can be from apprehension surrounding the newest features, and whether they are worth Apple’s premium price point.

I will look into this further to see which most frequently used words fit under positive and negative sentiments.

positive_iphone <- iphone_reviews_bing_sentiments %>%
  filter(sentiment == "positive") %>%
  count(word, sort = TRUE) %>%
  top_n(10)


negative_iphone <- iphone_reviews_bing_sentiments %>%
  filter(sentiment == "negative") %>%
  count(word, sort = TRUE) %>%
  top_n(10)
#positive wordcloud#

wordcloud(positive_iphone$word, 
          positive_iphone$n, 
          min.freq = 10, 
          colors = brewer.pal(8, "Reds"))

Love, intelligence, worth, nice and innovation are the top most frequently used words connected to positive sentiments.

#negative wordcloud# 

wordcloud(negative_iphone$word, 
          negative_iphone$n, 
          min.freq = 10, 
          colors = brewer.pal(8, "Blues"))

The words boring, crazy, expensive, dumb and bad are the most frequently used words surrounding negative sentiments.

NRC Dictionary Analysis

As the NRC dictionary contains 10 differnet sentiments, doing a sentiment analysis with it will give us a better idea of more precise customers sentiments surrounding this launch.

nrc_sentiments <- sentiments <- get_sentiments("nrc")


iphone_reviews_non_stop_words <- iphone_reviews %>% 
  unnest_tokens(word, comments, token = "words") %>%
  anti_join(stop_words)

iphone_reviews_nrc_sentiments <- inner_join(iphone_reviews_non_stop_words, nrc_sentiments)
Warning in inner_join(iphone_reviews_non_stop_words, nrc_sentiments): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 19 of `x` matches multiple rows in `y`.
ℹ Row 13775 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
count_iphone_reviews_nrc_sentiments <- iphone_reviews_nrc_sentiments %>% 
  anti_join(stop_words) %>%
  count(sentiment, sort = TRUE)

knitr::kable(count_iphone_reviews_nrc_sentiments, "pipe", col.names = c("Sentiment", "Count"), align = c("l", "c"), caption = "Count of Sentiments")
Count of Sentiments
Sentiment Count
positive 2990
negative 1985
trust 1462
anticipation 1388
joy 1068
anger 1037
fear 920
sadness 817
disgust 608
surprise 565

Sentiments falling under positive are the most frequent, with trust, anticipation and joy being the positive sentiments with the highest counts.

This indicates that Apple customers are excited surrounding the launch, and there is a sense of anticipation which perhaps relates to upgrading their current phone to the newest iPhone 16 or iPhone 16 Pro. The sentiment of trust relates to the customers trust toward the Apple brand and its products, which is a big positive for Apple.

The biggest negative sentiment towards the newest iPhone launch is anger, which indicates that customers with negative opinions of this launch are angry about something. I will filter out angry sentiments to find the most frequently used words correlating to anger.

iphone_reviews_neg_sentiments <- filter(iphone_reviews_nrc_sentiments, sentiment == "anger")

top_angry_sentiments <- count(iphone_reviews_nrc_sentiments, word, sort = TRUE) %>%
  top_n(15)

knitr::kable(top_angry_sentiments, "pipe", col.names = c("Word", "Count"), align = c("l", "c"), caption = "Count of Most Frequent Words Relating to Anger")
Count of Most Frequent Words Relating to Anger
Word Count
money 570
battery 326
intelligence 324
finally 276
love 242
wait 232
bad 220
crazy 212
hate 175
hope 165
feature 162
time 162
honest 161
pretty 148
excited 135

The most frequent words used under the anger sentiment are money and battery, indicating that Apple customers are angry about the price point of the newest phones as well as the longevity of the phones battery.