Topic Modeling of Google Play Store Reviews

Topic modeling is a popular tool for understanding text-based data. Numerous methods have been developed to perform topic modeling but the most frequently utilized method is the Latent Dirichlet Allocation (LDA), which was presented by Blei et al. in 2003. The LDA model is a Bayesian mixture model for discrete data where topics are assumed to be uncorrelated. This approach also assumes that there are three levels of data structure – word, topic, and document and that each document may be composed of multiple topics.

In this short demo, we will use the Google Play Reviews Data. Octoparse 8.5.8 was used to extract data from Google Play. A sample of 19,886 user reviews from 497 different apps across 12 app categories was collected. The app categories included were Game, Education, Business, Tools, Entertainment, Music & Songs, Food, Shopping, Lifestyle, Productivity, Social, and Dating. On average, 40 sample user reviews were selected from each sample app. The data collected includes the star rating, date of review, the text review, and the number of votes that found the review helpful. Only reviews in the English language were included. The extraction did not collect any information that may personally identify the user.

The reviews were divided into two groups. The “positive” reviews are those with associated star ratings of 4 or 5 while the “negative” reviews are those with star ratings 3 or below. Text analysis was then performed for each group as discussed in the next sections

The following R codes were used to create a five-topic model using LDA.

# Get Data
setwd("C:\\Users\\Asus\\Documents\\Publications and Conferences\\TPS 2023")
gplay <- read.csv(".\\App_Reviews_Google_Play_Full.csv", stringsAsFactors = FALSE)



#+++++++++++++++++++++++++++++++++++++++++++++++++++++#
####                 TEXT PREPARATION              ####
#+++++++++++++++++++++++++++++++++++++++++++++++++++++#

library(tidyverse)
library(tidytext)

gplay_cat <- gplay[gplay$Rating >= 4,]    # For positive reviews
#gplay_cat <- gplay[gplay$Rating <= 3,]   # For negative reviews

# Create Custom Stop Words
custom_stop_words <- tribble(
  ~word, ~lexicon,
  "app", "appname",
  "1", "one",
  "3", "three",
  "5", "five",
  "19", "19",
  "2", "two",
  "4", "four",
  "10", "ten",
  "game", "game"
)

# Append this to stop_words
stop_words2 <- stop_words %>%
  bind_rows(custom_stop_words)


# Get Tidy Text
review_tidy <- gplay_cat %>%
  unnest_tokens(word, Comment) %>%
  anti_join(stop_words2, by="word")


# Perform Stemming in Review Text
library(tm)
library(SnowballC)

review_tidy$word <- stemDocument(review_tidy$word)
review_tidy$word <- removeNumbers(review_tidy$word)
review_tidy$word <- removePunctuation(review_tidy$word)



#+++++++++++++++++++++++++++++++++++++++++++++++++#
####                    LDA                    ####
#+++++++++++++++++++++++++++++++++++++++++++++++++#
library(topicmodels)

review_dtm <- review_tidy %>%
  count(word, Num) %>%
  cast_dtm(Num, word, n) %>%
  as.matrix()
dim(review_dtm)
## [1] 8896 9524
review_dtm <- review_dtm[,-c(1:8)]

# Using LDA()
review_lda <- LDA(
  review_dtm,
  k = 5,   # Extract a 5-topic model
  method = "Gibbs",
  control = list(seed=42)
)

review_topics <-  tidy(review_lda, matrix = "beta")

# View Words By Topic
library(ggplot2)

word_probs <- review_topics %>%
  group_by(topic) %>%
  top_n(15, beta) %>%
  ungroup() %>%
  mutate(term2 = fct_reorder(term, beta))

ggplot(
  word_probs,
  aes(term2, beta, fill = as.factor(topic))
) +
  geom_col(show.legend = FALSE) +
  facet_wrap( ~ topic, scales = "free") +
  coord_flip() + xlab("Term") + ylab("Beta")

### Note: Typical Topic Models usually end here. The next section is just optional in case you want additional results in your topic model study.


# Classify Reviews by Topic
review_gamma <- tidy(review_lda, matrix = "gamma")
review_gamma$document <- as.numeric(review_gamma$document)
review_gamma <- as.data.frame(review_gamma)

topic_1 <- review_gamma[review_gamma$topic == 1,]
topic_1 <- topic_1[order(topic_1$document),]
topic_2 <- review_gamma[review_gamma$topic == 2,]
topic_2 <- topic_2[order(topic_2$document),]
topic_3 <- review_gamma[review_gamma$topic == 3,]
topic_3 <- topic_3[order(topic_3$document),]
topic_4 <- review_gamma[review_gamma$topic == 4,]
topic_4 <- topic_4[order(topic_4$document),]
topic_5 <- review_gamma[review_gamma$topic == 5,]
topic_5 <- topic_5[order(topic_5$document),]


summary(review_gamma$gamma)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.09286 0.16667 0.18919 0.20000 0.22078 0.50000
# classify gamma values
gam1 <- as.numeric(topic_1$gamma > quantile(review_gamma$gamma, 0.75))
gam2 <- as.numeric(topic_2$gamma > quantile(review_gamma$gamma, 0.75))
gam3 <- as.numeric(topic_3$gamma > quantile(review_gamma$gamma, 0.75))
gam4 <- as.numeric(topic_4$gamma > quantile(review_gamma$gamma, 0.75))
gam5 <- as.numeric(topic_5$gamma > quantile(review_gamma$gamma, 0.75))


text_topic <- data.frame(Num = topic_1$document, topic1 = gam1,
                         topic2 = gam2, topic3 = gam3, topic4 = gam4, topic5=gam5)

gplay_match <- merge(gplay_cat, text_topic, by="Num", all.x = TRUE)


# Count Number of Reviews that fall on each topic
topic_counts <- gplay_match %>%
  count(topic1)     # You may also count topic2, topic3, topic4, or topic5
topic_counts
##   topic1    n
## 1      0 6528
## 2      1 2368
## 3     NA    4
topic_counts <- gplay_match %>%
  group_by(Genre) %>%
  count(topic5)
print(topic_counts, n=26)
## # A tibble: 26 × 3
## # Groups:   Genre [12]
##    Genre         topic5     n
##    <chr>          <dbl> <int>
##  1 Business           0   299
##  2 Business           1    92
##  3 Dating             0   244
##  4 Dating             1   120
##  5 Education          0  1223
##  6 Education          1   194
##  7 Entertainment      0   124
##  8 Entertainment      1   146
##  9 Food               0   147
## 10 Food               1    39
## 11 Game               0  1746
## 12 Game               1   495
## 13 Lifestyle          0   174
## 14 Lifestyle          1    36
## 15 Lifestyle         NA     1
## 16 Music & Songs      0  1184
## 17 Music & Songs      1   277
## 18 Productivity       0   532
## 19 Productivity       1   109
## 20 Shopping           0   304
## 21 Shopping           1    61
## 22 Social             0   526
## 23 Social             1   558
## 24 Social            NA     3
## 25 Tools              0   141
## 26 Tools              1   125
# Find top documents per topic
k <- 1                # Use 2, 3, 4, or 5
dat <- review_gamma[review_gamma$topic == k,]
top_review <- dat[order(dat$gamma, decreasing = TRUE),]
head(top_review)
##      document topic     gamma
## 5163     2572     1 0.4358974
## 3962     3965     1 0.4318182
## 2437     2414     1 0.4303797
## 3985     1142     1 0.4285714
## 241      2403     1 0.4268293
## 2550      768     1 0.4146341
# See comments
docs <- top_review$document[c(1:6)]
for(i in 1:length(docs)) {
  print(gplay_cat[gplay_cat$Num == docs[i],]$Comment)
}
## [1] "Not bad.. But once you run out of energy you are eventually forced to spend money or wait to continue playing. Also forced To spend money when you need a more powerful weapon which costs a ridiculous amount compared to how much you are able to earn through-out the game. I do love how you are saving people though. Good game but unless u got money to spend on a mobile game, then you are pretty much stuck when you get to a certain point."
## [1] "Love this game. Very interactive and addictive. Awesome graphics. Be prepared to spend real money on it though. There are some long wait times for certain items especially the ones that are randomly dropped. Gems, saws, and hands drop frequently but deads and markers are rare. I have to spend real money every time I need more land. I think we should be able to purchase at least the rare items with coins earned from playing. I've played similar games and not spent a cent on them."
## [1] "This game is really fun. There's plenty of things to do and a ton of restaurants to unlock. I do have one problem though. It's hard to get gems to upgrade stuff. Sure, you could open chef mystery boxes, play the casino, and maybe participate in tournaments and challenges. But if you need gems fast and don't want to spend actual money you're going to spend a lot of time replaying levels to get xp and level up. But other than that it's a great game:)"
## [1] "I love this game, the rising difficulty as the game progresses is fun, and the mini games are enjoyable. I really love the addition of the season passes and the event specials, but one disappointing thing is that we don't get a chance to redo these mini games if we don't finish them within the alloted time frame. For example, I really wanted to finish Lars' makeover, but I got stuck on a hard level for 2 days and failed. I wish old events and seasons would return every so often."
## [1] "I love this game! I've been playing it since day one. Had completed 22 restaurants. All 40 levels, all tasks, kitchen and interior upgrades. I had over 1,000 gems and 3 millions coins and lost my progress. I never once bought gems or coins. I just played levels over and over until a new restaurant was released. I started over and now I've completed 27 restaurants the same as before. I had 6 million coins until the new upgrade to the casino. It's take too many coins and never win. Fix it back!"
## [1] "If you love chaos and sharks, then download. This game is very nice, with smooth graphics, fun art style, and unpredictable gameplay. They have many sharks to work for, making it worth it to work for one. It has 4 diverse maps, each having different obstacles, animals, and mechanics. No surprise ads, only for optional boosts, and the game has competitions often. They also look after shark awareness a lot, which is great for these endangered fish. All over it is a great and fun way to kill time."

For the full report, see the attached file on Topic Identification and Classification of Google Play Store Reviews. Also check chapter 6 of Text Minining in R available at https://www.tidytextmining.com/