Assignment 8

Text Analytics

Line of Inquiry and Rationale

I’m curious how reviews for Counter-Strike 2 and DOTA 2 compare. I plan to conduct a sentiment analysis to examine whether reviews for each game tend to have more positive or negative words. I’m interested in exploring whether the tone of reviews differ based on day of the week, and if there are certain combinations of words that tend appear together in reviews. I hope to discover whether game reviewers feel more strongly about one game over another, which will shape my questions and the proceeding analysis.

Data Collection

I am using review data collected from Steam’s API to evaluate the sentiment of gamers’ reviews for DOTA 2 (game ID: 570) and Counter-Strike 2 (game ID: 730). This data will provide evidence to gauge the emotional sentiment of reviews. I chose to examine a subset of all reviews - namely, 100 reviews per game. I ensured that the number of reviews collected was the same for each game to avoid biasing my results.

Load Packages

library(tidyverse) # all the tidy things
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate) # for data wrangling with dates
library(tidytext)  # for tidy text mining
library(textdata)  # for analyzing lexicons of sentiment data
library(widyr)     # for calculating pairwise counts
library(igraph)    # for creating network analysis graphs

Attaching package: 'igraph'

The following objects are masked from 'package:lubridate':

    %--%, union

The following objects are masked from 'package:dplyr':

    as_data_frame, groups, union

The following objects are masked from 'package:purrr':

    compose, simplify

The following object is masked from 'package:tidyr':

    crossing

The following object is masked from 'package:tibble':

    as_data_frame

The following objects are masked from 'package:stats':

    decompose, spectrum

The following object is masked from 'package:base':

    union
library(ggraph)    # for relational data used in networks

Import Data

steam_reviews <- read.csv("https://myxavier-my.sharepoint.com/:u:/g/personal/moningk1_xavier_edu/IQDjpTysBoGcRrBMRbXPKuiVATT5wYKvQ14r0BUJoKrx4Rw?download=1")

I verified that there are 100 observations (reviews) per game. Ensuring the same number of results for each game helps avoid bias and makes comparisons more fair.

# check review count per game
table(steam_reviews$game_id)

570 730 
100 100 

Data Wrangling

Here, I converted the numeric date/timestamp column into a date format so I can perform a chronological analysis.

# Convert numeric timestamp_created to datetimestamp
steam_reviews <-
  steam_reviews %>%
  mutate(datetimestamp = as_datetime(timestamp_created)) %>% 
  select(-timestamp_created)

Using the new datetimestamp column, I arranged the dataset by least to most recent and then assigned each review an ID accordingly. I removed the field “X”, which previously existed as a review identification field.

steam_reviews <- 
  steam_reviews %>% 
  arrange(datetimestamp) %>% 
  mutate(review_id = row_number()) %>% 
  select(-X)

Here, I unnested tokens and removed stop words. Specifically, this code breaks up reviews into individual words (“tokens”) and then creates one word per row, storing the tokens in a new column. Importantly, I removed stop words - words that do not carry as much weight or meaning - from this new dataset to focus my analysis on words that are more likely to convey an emotional sentiment.

# unnest tokens by game_id and remove stop words
tidy_steam_reviews <- 
  steam_reviews %>%
  unnest_tokens(word, review) %>%
  anti_join(stop_words)
Joining with `by = join_by(word)`

Question:

What reviews are commonly used in reviews for each game?

Data

I used the review data that I collected from Steam’s API to words commonly used in reviews for Counter-Strike 2 and DOTA 2. I used the subset of reviews that I gathered - 100 reviews per game - to conduct my analysis.

Analysis

I retrieved simple counts of words associated with reviews for each game, arranging them from most to least frequently occurring.

# simple word counts for each game
tidy_steam_reviews %>% 
  group_by(game_id, word) %>% 
  summarize(n = n()) %>% 
  arrange(-n)
`summarise()` has grouped output by 'game_id'. You can override using the
`.groups` argument.
# A tibble: 837 × 3
# Groups:   game_id [2]
   game_id word         n
     <int> <chr>    <int>
 1     570 game        60
 2     730 game        42
 3     570 10          17
 4     570 play        16
 5     570 dota        15
 6     570 fun         12
 7     570 playing     10
 8     730 cheaters    10
 9     570 life         9
10     570 time         9
# ℹ 827 more rows

Results

Unsurprisingly, the word “game” appears the most in reviews for both games. Interestingly, words like “fun”, “playing”, “life”, “community”, and “love” appear frequently in DOTA 2 reviews. These words typically have a more positive connotation, suggesting more positivity or enthusiasm from these reviewers; However the word “toxic” also appears eight times in reviews for DOTA 2, contrasting with the more positive words. There are fewer common words associated with Counter-Strike 2 reviews, suggesting that reviewer attitudes may differ more with this game. One of the top words associated with this game is “cheaters” - indicating that reviewers may detect foul gameplay and therefore dislike this game. However, words like “fun” and “love” also appear in several reviews, indicating that some reviewers find the game enjoyable.

Question:

Do the two games share similar word combinations, or do players talk about each game differently in reviews?

Data

Again, I used the review data that I collected from Steam’s API to words commonly used in reviews for Counter-Strike 2 and DOTA 2. I used the subset of reviews that I gathered - 100 reviews per game - to conduct my analysis.

Analysis

I retrieved a pairwise count to find the frequency of combinations of words appearing together within any given review.

# pairwise count finds pairwise combos of words appearing in review within any review
review_word_pairs <-
  tidy_steam_reviews %>%
  group_by(game_id) %>%                # separate counts per game
  pairwise_count(
    item = word,                        # token column
    feature = review_id,                        # unique review ID
    upper = FALSE
  ) %>%
  arrange(-n)

# view
view(review_word_pairs)

Results

Unsurprisingly, the word “game” frequently appears in reviews in combination with other words. Some interesting combinations found in DOTA 2 reviews include “life” and “community”, “friends” and community”, “friends” and “toxic”, and “community” and “toxic”. This suggests that, while there may be a strong gaming community with DOTA 2, some reviewers feel that the game fandom is toxic. With the Counter-Strike 2 reviews, I found it interesting that “game” and “cheaters” and “game” and “cheat” appear frequently together, reinforcing the previous findings implying that reviewers may dislike this game due to the prevalence of cheating.

Question:

How do networks of co-occurring words differ between DOTA 2 and Counter-Strike 2?

Data

I used the same review data that I collected from Steam’s API.

Analysis

Here, I visualize the networks of co-occurring words for DOTA 570. First, I set a seed to eliminate the randomness that we’d otherwise see in the task; in other words, the set seed ensures that the visualization will look the same for everyone. I then filtered to DOTA 2’s game ID to only examine word pairs from this game, dropped the game ID column since it isn’t needed for the graph, and then filtered to only include word pairs occurring at least three times to prevent the visual from being overly complicated.

# Visualizing the networks of co-occurring words for DOTA 570
set.seed(1234)

review_word_pairs %>% 
  filter(game_id == 570) %>% 
  ungroup() %>%  
  select(-game_id) %>% 
  filter(n >= 3) %>%
  graph_from_data_frame() %>% 
  ggraph(layout = "fr") +     
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "#00263e") +
  geom_node_point(size = 3) +
  geom_node_text(aes(label = name), repel = TRUE,
                 point.padding = unit(0.2, "lines")) +
  theme_void()
Warning: The `trans` argument of `continuous_scale()` is deprecated as of ggplot2 3.5.0.
ℹ Please use the `transform` argument instead.

Here, I visualize the networks of co-occurring words for Counter-Strike 2, repeating the same steps as above but including Counter-Strike 2’s game ID.

set.seed(1234)

review_word_pairs %>% 
  filter(game_id == 730) %>% 
  ungroup() %>%  
  select(-game_id) %>% 
  filter(n >= 2) %>%
  graph_from_data_frame() %>% 
  ggraph(layout = "fr") +     
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "#00263e") +
  geom_node_point(size = 3) +
  geom_node_text(aes(label = name), repel = TRUE,
                 point.padding = unit(0.2, "lines")) +
  theme_void()

Results

In both visualizations, “game” is a central word - unsurprisingly.

In the network for DOTA 2, there appears to be a strong connection between “play” and “friends”, indicating that there may be a social aspect to this game. More positive words, like “play”, “fun”, “friends”, “life”, and “love” appear to be clustered together and somewhat interconnected. Negative words, like “toxic”, “bad”, and “lose” also appear together, but the relationships are less strong than those connecting more positive words. Overall, DOTA 2’s network conveys that reviews typically revolve around social gameplay and community experiences, but notably there are some mentions of toxicity and negative game outcomes.

In the network for Counter-Strike 2, the strongest connection is between “game” and “cheaters”, reinforcing that cheating is a recurring theme when it comes to Counter-Strike 2 reviews. Other strong connections appear between “game” and “fun”, and “game” and “love”, implying more positive associations with the game. With this network, there is also a small cluster comprising of words like “steamcommunity.com”, “id”, and “inventory”; this cluster is separate from other gameplay-related words, indicating that some reviewers discussed their profiles or inventory systems. Despite many Counter-Strike 2 reviews revolving around cheating, many positive words also appear.

Question:

What types of emotions tend to appear in reviews for each game?

Data

I used the same review data collected from the API.

Analysis

Here, I visualize emotional sentiment counts for each game. I created a bar chart to display the number of words from each NRC emotion category that appear in the reviews of each game. To do this, I first had to load in the NRC lexicon.

# load in the NRC lexicon
nrc <- get_sentiments("nrc")
# Visualize emotional sentiment counts for each game
tidy_steam_reviews %>% 
  inner_join(nrc, by = "word", relationship = "many-to-many") %>% 
  group_by(sentiment, game_id) %>% 
  summarize(n = n()) %>% 
  mutate(game_id = factor(game_id, labels = c("Dota 2", "CS:GO"))) %>% # if only 2 games
  ggplot(aes(x = sentiment, y = n, fill = game_id)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("#c08206","#00263e")) +
  labs(title = "Game Sentiment Scores",
       subtitle = "Total number of emotive words in reviews",
       y = "Total Number of Words",
       x = "Emotional Sentiment",
       fill = "Game") +
  theme_minimal()
`summarise()` has grouped output by 'sentiment'. You can override using the
`.groups` argument.

Results

Across the board, DOTA 2 scores higher in terms of associations with various emotional sentiments. This suggests that DOTA 2 reviews have more emotional connotations than Counter-Strike 2 reviews. Negative and positive sentiments are the highest for both games, but appear to be much more pronounced in DOTA 2 reviews. Interestingly, specific emotions like anger, anticipation, and trust are far more frequent in DOTA 2 reviews. Overall, the repeated occurrence of emotionally charged words in DOTA 2 reviews demonstrates that DOTA 2 reviews are more emotionally expressive, while the absence of emotional words in Counter-Strike 2 reviews indicates that these reviews are perhaps more straightforward or less emotionally intense. The strong presence of negative and positive words for both games suggests that reviews are polarized - with either positive or negative opinions.

Question:

What positive and negative words do players most commonly use when talking about DOTA 2 and Counter-Strike 2?

Data

I used the same data collected from the API.

Analysis

Here, I visualize the most frequent positive and negative words for both games. First, I grouped words by game and word, and then counted how many times each word appears in each game. I joined with NRC to assign each word a sentiment. Lastly, I created a barplot of positive and negative words. I decided to exclude the word “pretty” since this can be used as either an adjective by itself or a descriptor (ex: “pretty bad”).

# groups words by game ID and word, and attaches NRC sentiment category
steam_reviews_counts <- 
  tidy_steam_reviews %>% 
  group_by(game_id, word) %>% 
  summarize(n = n()) %>% 
  inner_join(nrc) # join to nrc data
`summarise()` has grouped output by 'game_id'. You can override using the
`.groups` argument.
Joining with `by = join_by(word)`
Warning in inner_join(., nrc): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 17 of `x` matches multiple rows in `y`.
ℹ Row 987 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
# Barplot for positive and negative words
steam_reviews_counts %>% 
  group_by(game_id) %>% 
  filter(n>2) %>% 
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>% 
  filter(!word %in% c("pretty")) %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~game_id, ncol = 2) +
  geom_text(aes(label = signif(n, digits = 3)), nudge_y = 8) +
  labs(title = "Positive and Negative Words for DOTA 2 and Counter-Strike 2",
       subtitle = "Only words appearing at least 2 times are shown") 

Results

For DOTA 2, fun is the most common positive word. Other positive words include “community”, “love”, and “recommend”. Frequent negative words include “bad”, “lose”, “trash”, and “toxic.” Since toxic is the most commonly occurring negative word, this suggest that many reviews may revolve around the toxicity of this game or its community.

For Counter-Strike 2, “fun” also appears as the most common positive word, along with “love.” There are fewer repeated sentiment reviews with this game, suggesting less emotional intensity in reviews.

Question:

Does review sentiment vary by day of week?

Data

Again, I used the same data gathered from the API.

Analysis

Here, I decided to load in the bing sentiment lexicon, which assigns words to be either positive or negative. I decided to use this lexicon for this visualization for simplicity. I extracted the day of the week from the datetimestamp column, and then calculated the number of positive and negative words per game each day. I then computed a positivity score to gauge emotional sentiment. Finally, I plotted positivity scores for each game by day of the week.

bing <- 
  get_sentiments("bing")

# Positivity by day of week per game
tidy_steam_reviews %>%
  inner_join(bing, by = "word") %>% 
  mutate(day = wday(datetimestamp, label = TRUE)) %>%   # use your converted date_time
  group_by(game_id, day, sentiment) %>% 
  summarize(n = n()) %>% 
  spread(sentiment, n, fill = 0) %>% 
  mutate(sentiment = positive - negative) %>% 
  ggplot(aes(x = day, y = sentiment, fill = factor(game_id))) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("#c08206","#00263e"),
                    labels = c("Dota 2", "Counter Strike 2")) +
  labs(title = "Game Positivity Scores by Day of the Week",
       subtitle = "Positivity score = positive words minus negative words",
       y = "Total Positivity Score",
       x = "Day of the Week",
       fill = "Game")
`summarise()` has grouped output by 'game_id', 'day'. You can override using
the `.groups` argument.

Results

Counter-Strike 2 reviews only appear on Mondays, indicating that there were either no reviews published on Mondays, or that those that were published did not meet the threshold for inclusion. This could reinforce the general lack of emotional sentiment found in Counter-Strike 2 reviews. Overall, Counter-Strike 2 reviews have a high positive score on Monday.

DOTA 2 scores lower in terms of positivity on weekdays, whereas it scores higher on weekends (Saturday and Sunday).

Conclusion

Based on my analysis, DOTA 2 appears to display more emotional volatility in reviews, while Counter-Strike 2 reviews are generally less emotionally charged. This could suggest that DOTA 2 players feel more strongly about the game than Counter-Strike 2.