Introduction

Golf is a very complex and difficult sport with many components. Whether it be the clubs, balls, tees, and more, every factor plays into how you shoot (or more importantly, your confidence level at the tee box). One of the proclaimed important aspects of the golf game is the ball. On the top level, there is the Titleist ProV1, a ball used by most tour players and can cost upward of $4 a ball. On the other hand there is the Kirkland Signature ball, a ball made by Costco that is not used by any tour players. The Kirkland Signature ball has been around for awhile, but it has exploded in popularity in the golf scene over the past 6 months. I will be looking at a variety of Amazon reviews for this product to determine what people think about these two brands. Through sentiment analysis I will attempt to answer the following questions:

What different positive characteristics do most people associate their golf ball of choice with?

What general emotions to users of the ProV1 and Kirkland Signature ball feel when using their ball on the course?

How has the sentiment of customer reviews for Kirkland and Prov1 golf balls changed over time?

Loading In Necessary Packages/Data

I have collected 90 Amazon reviews with dates and review content for both the ProV1 and Kirkland Signature ball. They can be downloaded within the link of the read.csv command

library(tm)

## Warning: package 'tm' was built under R version 4.3.3

## Loading required package: NLP

library(slam)
library(textclean)

## Warning: package 'textclean' was built under R version 4.3.3

library(tidytext)

## Warning: package 'tidytext' was built under R version 4.3.3

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

prov1_reviews <-
  read.csv("https://myxavier-my.sharepoint.com/:x:/g/personal/estepa1_xavier_edu/EdlGIGWYCj9It03gQejlGuEBck6e_qtoYGWpgfmCFDhaPw?download=1")
kirkland_reviews <-
  read.csv("https://myxavier-my.sharepoint.com/:x:/g/personal/estepa1_xavier_edu/EQJZf2oShpJAv277EWIH-oYB75EcJDa215a4dcgfOZLZaw?download=1")

Question 1

First, I will look at some of the most common words excluding stop words within the reviews.

nrc <- get_sentiments("nrc") %>% filter(sentiment == "positive")

corpus <- Corpus(VectorSource(prov1_reviews$review_contents))

# Preprocess the text
corpus <- tm_map(corpus, content_transformer(tolower))

## Warning in tm_map.SimpleCorpus(corpus, content_transformer(tolower)):
## transformation drops documents

corpus <- tm_map(corpus, removePunctuation)

## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation drops
## documents

corpus <- tm_map(corpus, removeNumbers)

## Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops
## documents

corpus <- tm_map(corpus, removeWords, stopwords("en"))

## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("en")):
## transformation drops documents

# Create a term-document matrix
tdm <- TermDocumentMatrix(corpus)

word_counts <- row_sums(as.matrix(tdm), na.rm = TRUE)
word_counts <- sort(word_counts, decreasing = TRUE)

# Convert to dataframe
words_df <- data.frame(word = names(word_counts), count = word_counts, stringsAsFactors = FALSE)

# Filter for positive words
positive_words <- words_df %>% 
  semi_join(nrc, by = c("word" = "word")) %>%
  top_n(10, count)

ggplot(positive_words, aes(x = reorder(word, count), y = count)) +
  geom_bar(stat = "identity", fill = "coral") +
  labs(title = "Top 10 Positive Words in Prov1 Reviews", x = "Word", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

corpus <- Corpus(VectorSource(kirkland_reviews$review_contents))

# Preprocess the text
corpus <- tm_map(corpus, content_transformer(tolower))

## Warning in tm_map.SimpleCorpus(corpus, content_transformer(tolower)):
## transformation drops documents

corpus <- tm_map(corpus, removePunctuation)

## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation drops
## documents

corpus <- tm_map(corpus, removeNumbers)

## Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops
## documents

corpus <- tm_map(corpus, removeWords, stopwords("en"))

## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("en")):
## transformation drops documents

# Create a term-document matrix
tdm <- TermDocumentMatrix(corpus)

word_counts <- row_sums(as.matrix(tdm), na.rm = TRUE)
word_counts <- sort(word_counts, decreasing = TRUE)

# Convert to dataframe
words_df <- data.frame(word = names(word_counts), count = word_counts, stringsAsFactors = FALSE)

# Filter for positive words
positive_words <- words_df %>% 
  semi_join(nrc, by = c("word" = "word")) %>%
  top_n(10, count)  # Get top 10 positive words

ggplot(positive_words, aes(x = reorder(word, count), y = count)) +
  geom_bar(stat = "identity", fill = "lightgreen") +
  labs(title = "Top 10 Positive Words in Kirkland Reviews", x = "Word", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

As we can see, there are different positive words that consumers associate with their type of golf ball. With the ProV1, the word “gift” is mentioned a lot more frequently in reviews. This could be due to the fact that ProV1s are the best golf ball and are expensive, so they are often given out as gifts. On the other hand, the Kirkland Signature ball contain a lot more words associated with value like worth, deal, expect, and a lot more good. This is because they are around a third of the price, and give consumers more bang for their buck.

Question 2

Now, I will conduct a sentiment analysis for both balls, this will pool words into general categories so we can see the bigger emotions consumers feel about their ball of choice.

nrc <- get_sentiments("nrc")

corpus <- Corpus(VectorSource(prov1_reviews$review_contents))

# Preprocess the text
corpus <- tm_map(corpus, content_transformer(tolower))

## Warning in tm_map.SimpleCorpus(corpus, content_transformer(tolower)):
## transformation drops documents

corpus <- tm_map(corpus, removePunctuation)

## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation drops
## documents

corpus <- tm_map(corpus, removeNumbers)

## Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops
## documents

corpus <- tm_map(corpus, removeWords, stopwords("en"))

## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("en")):
## transformation drops documents

# Create a term-document matrix
tdm <- TermDocumentMatrix(corpus)

# Convert to matrix and get term frequency
word_counts <- row_sums(as.matrix(tdm), na.rm = TRUE)
words_df <- data.frame(word = names(word_counts), count = word_counts, stringsAsFactors = FALSE)

# Join with NRC lexicon
emotion_counts <- words_df %>%
  inner_join(nrc, by = "word") %>%
  group_by(sentiment) %>%
  summarise(total_count = sum(count))

ggplot(emotion_counts, aes(x = sentiment, y = total_count, fill = sentiment)) +
  geom_bar(stat = "identity") +
  labs(title = "Emotional Sentiment Distribution in Prov1 Reviews", x = "Emotion", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        plot.title = element_text(hjust = 0.5)) +
  scale_fill_brewer(palette = "Set3")

corpus <- Corpus(VectorSource(kirkland_reviews$review_contents))

# Preprocess the text
corpus <- tm_map(corpus, content_transformer(tolower))

## Warning in tm_map.SimpleCorpus(corpus, content_transformer(tolower)):
## transformation drops documents

corpus <- tm_map(corpus, removePunctuation)

## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation drops
## documents

corpus <- tm_map(corpus, removeNumbers)

## Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops
## documents

corpus <- tm_map(corpus, removeWords, stopwords("en"))

## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("en")):
## transformation drops documents

# Create a term-document matrix
tdm <- TermDocumentMatrix(corpus)

# Convert to matrix and get term frequency
word_counts <- row_sums(as.matrix(tdm), na.rm = TRUE)
words_df <- data.frame(word = names(word_counts), count = word_counts, stringsAsFactors = FALSE)

# Join with NRC lexicon
emotion_counts <- words_df %>%
  inner_join(nrc, by = "word") %>%
  group_by(sentiment) %>%
  summarise(total_count = sum(count))

ggplot(emotion_counts, aes(x = sentiment, y = total_count, fill = sentiment)) +
  geom_bar(stat = "identity") +
  labs(title = "Emotional Sentiment Distribution in Kirkland Reviews", x = "Emotion", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        plot.title = element_text(hjust = 0.5)) +
  scale_fill_brewer(palette = "Set3")

The emotional sentiment charts for both balls is shockingly similar. However, there are a few key differences that could represent a common theme. First and foremost, there are significantly more users who are surprised by the Kirkland golf ball. This could be due to it providing premium performance, despite it not being name brand. On the other hand, there is more of a negative sentiment towards the Kirkland ball. This could represent some users having a poor experience, or they could just be taking out their poor golf game on an Amazon review. Finally, there is a lot more elements of Joy in the ProV1 review. This could come from the gift giving element we examined earlier.

Question 3

For this question, we will look at the sentiment scores over time for both golf balls to see if there are any trends.

nrc <- get_sentiments("nrc") %>% filter(sentiment %in% c("positive", "negative"))

prov1_reviews$review_dates <- as.Date(prov1_reviews$review_dates)

start_date <- Sys.Date() - years(1)
filtered_reviews <- prov1_reviews %>% 
  filter(review_dates >= start_date) %>%
  unnest_tokens(word, review_contents)

sentiment_data <- filtered_reviews %>%
  inner_join(nrc, by = "word") %>%
  count(review_dates = floor_date(review_dates, "quarter"), sentiment) %>%
  group_by(review_dates, sentiment) %>%
  summarise(sentiment_count = sum(n), .groups = 'drop')

## Warning in inner_join(., nrc, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 586 of `x` matches multiple rows in `y`.
## ℹ Row 2485 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

ggplot(sentiment_data, aes(x = review_dates, y = sentiment_count, fill = sentiment)) +
  geom_col(position = "dodge") +
  labs(title = "Quarterly Sentiment Trends for Prov1 Reviews",
       x = "Quarter",
       y = "Sentiment Count",
       fill = "Sentiment") +
  theme_minimal() +
  scale_x_date(date_breaks = "1 month", date_labels = "%b %Y") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

kirkland_reviews$review_dates <- as.Date(kirkland_reviews$review_dates)

start_date <- Sys.Date() - years(1)
filtered_kirkland <- kirkland_reviews %>% 
  filter(review_dates >= start_date) %>%
  unnest_tokens(word, review_contents)

# Remove stop words
filtered_kirkland <- filtered_kirkland %>%
  anti_join(get_stopwords(), by = "word")

sentiment_data_kirkland <- filtered_kirkland %>%
  inner_join(nrc, by = "word") %>%
  count(review_dates = floor_date(review_dates, "quarter"), sentiment) %>%
  group_by(review_dates, sentiment) %>%
  summarise(sentiment_count = sum(n), .groups = 'drop')

## Warning in inner_join(., nrc, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 192 of `x` matches multiple rows in `y`.
## ℹ Row 856 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

ggplot(sentiment_data_kirkland, aes(x = review_dates, y = sentiment_count, fill = sentiment)) +
  geom_col(position = "dodge") +
  labs(title = "Quarterly Sentiment Trends for Kirkland Reviews",
       x = "Quarter",
       y = "Sentiment Count",
       fill = "Sentiment") +
  theme_minimal() +
  scale_x_date(date_breaks = "1 month", date_labels = "%b %Y") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

There are some interesting trends here for the sentiment counts over time. We can see in the ProV1 chart that Positive Sentiment is highest in January of 2024. This could be due to the fact that it is right after the Christmas season, and many gave them to people as gifts. In the Kirkland chart. We can see there is a larger percentage of negative sentiment compared to the ProV1s. They tend to get the most reviews and positive sentiment in October, which is slightly unusual. This is when golf season is coming to an end and we enter winter. Further analysis may need to be conducted over a longer time period to see if this is a recurring trend.

Assignment 7

Andrew Estep

2024-05-03

Introduction

Loading In Necessary Packages/Data

Question 1

Question 2

Question 3

Conclusion