0. INTRODUCTION

For this independent study, I decided to analyze a new dataset comprising the reviews of App Store users about a popular financial management application, named “Mint.” This application helps users keep track of their income and expenses. It also provides them insights into their credit score, bills and balances. Although there are so many competitors for Mint, it has been able to rise to one of the top financial management apps on the market.


1. PREPARE

1b. Guiding Questions

During this data analysis, I liked to find an answer to three questions:

  1. What topics are mostly discussed in the reviews of a successful application?

  2. What finance-related topics will appear more in these reviews?

  3. How different are the results of topic modeling techniques (stemming, LDA vs STM) from each other?

1c. Set Up

First of all, we load packages needed for this walkthrough.

library(tidyverse)
library(tidytext)
library(SnowballC)
library(topicmodels)
library(stm)
library(ldatuning)
library(knitr)
library(LDAvis)

2. WRANGLE

2a. Import Data From Mint

I pulled 10,000 reviews of Mint application from Apple Store using a library named “app_store_scraper” in python and imported the data into this project.

mint_reviews <- read_csv("data/mint.csv", 
     col_types = cols(userName = col_character(),
                   title = col_character(), 
                   review = col_character()
                   ))
mint_reviews <- select(mint_reviews,userName,title,review)

2b. Cast a Document Term Matrix

Tidying Text

When I looked at different attributes in this dataset, I found out “title” column includes useful information. Hence, I concatenated the content of “review” and “title” columns and put the result into a new column named “combined”. Then, the data is cleaned and tokenized.

mint_reviews$combined = paste(mint_reviews$title, mint_reviews$review, sep=" ")

mint_selected <- mint_reviews %>%
  unnest_tokens(output = word, input = combined) %>%
  anti_join(stop_words, by = "word")

mint_selected <- select(mint_selected,userName, word)
mint_selected <- na.omit(mint_selected)

The most common words in the reviews of Mint are:

mint_selected %>%
  count(word, sort = TRUE)
## # A tibble: 9,691 x 2
##    word         n
##    <chr>    <int>
##  1 app      12661
##  2 mint      5067
##  3 accounts  3572
##  4 it’s      2678
##  5 love      2438
##  6 update    2318
##  7 budget    2313
##  8 account   2270
##  9 time      2061
## 10 credit    1704
## # ... with 9,681 more rows

Looking at the most frequent words, I removed some of them which were not necessary for analysis such as “app” and “mint”.

mint_tidy <- mint_selected %>%
  select(userName, word) %>% 
  filter(!word == "app" & !word == "mint" & !word == "it’s" & !word == "i’ve" & !word == "i’m" & !word == "can’t" & !word == "don’t" & !word == "doesn’t" & !word == "won’t" & !word == "like" & !word == "see" & !word == "one" & !word == "just" & !word == "now" & !word == "use" & !word == "account" & !word == "accounts" & !word == "finances" & !word =="financial")
mint_count <- mint_tidy %>%
  count(word, sort = TRUE)

PS. I realized that there is a massive difference between apostrophe ’ and single-quote ’ in R. It took me some time to find it out, but if you want to filter a word like “I’m”, you have to use an apostrophe; otherwise, it would not work!

Terms like “budget,” “update” and “love” are what we would have expected to see from reviewers. However, the term “time” is not so intuitive and worth a quick look as well. I select 10 random samples of reviews using the sample_n() function for the term “time”.

## # A tibble: 10 x 1
##    combined                                                                     
##    <chr>                                                                        
##  1 "Used to be good This app and its support have gone the way of so many good ~
##  2 "Renewed Interface is not intuitive I think Mint actually updated the interf~
##  3 "Not as intuitive as phone or web app The layout and functionality of the iP~
##  4 "Good and bad I really like the layout and the quick view. The ability to co~
##  5 "Ok Devs, time to update!! Been using Mint for years now. Amazing app, good ~
##  6 "Good, but could be so much better I switched to Mint from Personal Cap a fe~
##  7 "stash i love mint and as advertised on mint, i use stash as well. since sta~
##  8 "Looking for a real update The app is good; it works well most of the time. ~
##  9 "Great idea, poor execution Everyone knows Mint’s basic purpose and it’s a g~
## 10 "Super helpful for tracking expenses I use it all the time"

Creating a Document Term Matrix

We will consider each individual review as a unique “document.” To do this, we can use the attribute “userName,” which is unique for each user. Given that, we would create a document term matrix by counting the number of times each word appears in the review of each user.

mint_dtm <- mint_tidy %>%
  count(userName, word) %>%
  cast_dtm(userName, word, n)

class(mint_dtm)
## [1] "DocumentTermMatrix"    "simple_triplet_matrix"
mint_dtm
## <<DocumentTermMatrix (documents: 9994, terms: 9678)>>
## Non-/sparse entries: 158369/96563563
## Sparsity           : 100%
## Maximal term length: 23
## Weighting          : term frequency (tf)

Processing and Stemming for STM

Let’s go ahead and prepare our reviews for structural topic modeling:

temp <- textProcessor(mint_reviews$combined,
                    metadata = mint_reviews,  
                    lowercase=TRUE, 
                    removestopwords=TRUE, 
                    removenumbers=TRUE,  
                    removepunctuation=TRUE, 
                    wordLengths=c(3,Inf),
                    stem=TRUE,
                    onlycharacter= FALSE, 
                    striphtml=TRUE, 
                    customstopwords=c("app","mint","it’s", "i’ve","i’m","can’t","don’t", "doesn’t","won’t","like","see", "one","just" ,"now","use","account","accounts","finances","financial"))
## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Remove Custom Stopwords...
## Removing numbers... 
## Stemming... 
## Creating Output...
meta <- temp$meta
vocab <- temp$vocab
docs <- temp$documents

Stemming Tidy Text

Let’s take a look at the original words and the stem that are produced:

stemmed_mint <- mint_reviews %>%
  unnest_tokens(output = word, input = combined) %>%
  anti_join(stop_words, by = "word") %>%
  filter(!word == "app" & !word == "mint" & !word == "it’s" & !word == "i’ve" & !word == "i’m" & !word == "can’t" & !word == "don’t" & !word == "doesn’t" & !word == "won’t" & !word == "like" & !word == "see" & !word == "one" & !word == "just" & !word == "now" & !word == "use" & !word == "account" & !word == "accounts" & !word == "finances" & !word =="financial") %>%
  mutate(stem = wordStem(word))

stemmed_mint
## # A tibble: 179,771 x 5
##    userName  title                       review                     word   stem 
##    <chr>     <chr>                       <chr>                      <chr>  <chr>
##  1 Mere67193 Using for years and love it "Hi there - I have been u~ love   love 
##  2 Mere67193 Using for years and love it "Hi there - I have been u~ budge~ budg~
##  3 Mere67193 Using for years and love it "Hi there - I have been u~ keepi~ keep 
##  4 Mere67193 Using for years and love it "Hi there - I have been u~ track  track
##  5 Mere67193 Using for years and love it "Hi there - I have been u~ credit cred~
##  6 Mere67193 Using for years and love it "Hi there - I have been u~ score  score
##  7 Mere67193 Using for years and love it "Hi there - I have been u~ couple coupl
##  8 Mere67193 Using for years and love it "Hi there - I have been u~ score  score
##  9 Mere67193 Using for years and love it "Hi there - I have been u~ grown  grown
## 10 Mere67193 Using for years and love it "Hi there - I have been u~ 50     50   
## # ... with 179,761 more rows

We can see that words like “budgeting” that occur frequently in our discussions have been reduced to the word stem “budget”.

## <<DocumentTermMatrix (documents: 9994, terms: 6394)>>
## Non-/sparse entries: 152439/63749197
## Sparsity           : 100%
## Maximal term length: 23
## Weighting          : term frequency (tf)
## <<DocumentTermMatrix (documents: 9994, terms: 9678)>>
## Non-/sparse entries: 158369/96563563
## Sparsity           : 100%
## Maximal term length: 23
## Weighting          : term frequency (tf)
## # A tibble: 6,394 x 2
##    stem         n
##    <chr>    <int>
##  1 budget    4017
##  2 updat     3554
##  3 love      2701
##  4 time      2651
##  5 track     2178
##  6 transact  2068
##  7 spend     2047
##  8 bill      2033
##  9 month     1788
## 10 bank      1754
## # ... with 6,384 more rows

Considering the fact that stemmed version 3,000 less rows, it may be more reasonable to use it rather than unstemmed data.

3. MODEL

3a. Fitting a Topic Modeling with LDA

I ran findingK() function and found out 7 is reasonable for number of topics in reviews to keep the semantic coherence high.

# n_distinct(ts_forum_data$forum_name)

mint_lda <- LDA(mint_dtm, 
                  k = 7, 
                  control = list(seed = 588)
                  )

mint_lda
## A LDA_VEM topic model with 7 topics.

3b. Fitting a Structural Topic Model

For using STM, these elements should be extracted.

docs <- temp$documents 
meta <- temp$meta 
vocab <- temp$vocab 

And now use these elements to fit the model using the same number of topics for K that we specified for our LDA topic model.

mint_stm <- stm(documents=docs, 
         data=meta,
         vocab=vocab, 
         K=7,
         max.em.its=25,
         verbose = FALSE)
mint_stm
## A topic model with 7 topics, 10000 documents and a 8025 word dictionary.

Let’s show the first 5 terms in each topic:

plot.STM(mint_stm, n = 10)

3c. Finding K

The FindTopicsNumber Function

I use this function to estimate the most preferable number of topics.

k_metrics <- FindTopicsNumber(
  mint_dtm,
  topics = seq(5, 50, by = 5),
  metrics = "Griffiths2004",
  method = "Gibbs",
  control = list(),
  mc.cores = NA,
  return_models = FALSE,
  verbose = FALSE,
  libpath = NULL
)

FindTopicsNumber_plot(k_metrics)

As we see, k = 35 seems to be the best choice, but based on the results I obtained from findingK() function, 7 is more likely to provide good results. My guess is the same as I do not think there would be a variety of topics in reviews.

The LDAvis Explorer

We can also use toLDAvis() function to generate visualizations for exploring topic and word distributions using LDAvis topic browser:

toLDAvis(mod = mint_stm, docs = docs)
## Loading required namespace: servr

4. EXPLORE

4a. Exploring Beta Values

Let’s take a look at the 5 most likely terms assigned to each topic.

terms(mint_lda, 10)
##       Topic 1        Topic 2        Topic 3      Topic 4        Topic 5   
##  [1,] "transactions" "update"       "love"       "budget"       "love"    
##  [2,] "bank"         "transactions" "time"       "spending"     "budget"  
##  [3,] "budget"       "fix"          "credit"     "transactions" "money"   
##  [4,] "month"        "track"        "bills"      "update"       "easy"    
##  [5,] "time"         "issues"       "update"     "money"        "bills"   
##  [6,] "track"        "time"         "bank"       "track"        "credit"  
##  [7,] "bills"        "budgets"      "budget"     "add"          "add"     
##  [8,] "easy"         "month"        "track"      "change"       "bill"    
##  [9,] "version"      "helpful"      "categories" "card"         "manually"
## [10,] "support"      "bank"         "card"       "information"  "feature" 
##       Topic 6      Topic 7      
##  [1,] "love"       "spending"   
##  [2,] "update"     "money"      
##  [3,] "time"       "credit"     
##  [4,] "track"      "user"       
##  [5,] "budget"     "support"    
##  [6,] "fix"        "months"     
##  [7,] "spending"   "information"
##  [8,] "categories" "track"      
##  [9,] "budgeting"  "month"      
## [10,] "version"    "link"

Now let’s look at this information visually:

tidy_lda <- tidy(mint_lda)
top_terms <- tidy_lda %>%
  group_by(topic) %>%
  slice_max(beta, n = 10, with_ties = FALSE) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  group_by(topic, term) %>%    
  arrange(desc(beta)) %>%  
  ungroup() %>%
  ggplot(aes(beta, term, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  labs(title = "Top 5 terms in each LDA topic",
       x = expression(beta), y = NULL) +
  facet_wrap(~ topic, ncol = 4, scales = "free")

4b. Exploring Gamma Values

We can combine our beta and gamma values to understand the topic prevalence in our corpus, and which words contribute to each topic.

td_beta <- tidy(mint_lda)
td_gamma <- tidy(mint_lda, matrix = "gamma")
top_terms <- td_beta %>%
  arrange(beta) %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  arrange(-beta) %>%
  select(topic, term) %>%
  summarise(terms = list(term)) %>%
  mutate(terms = map(terms, paste, collapse = ", ")) %>% 
  unnest()
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(terms)`
gamma_terms <- td_gamma %>%
  group_by(topic) %>%
  summarise(gamma = mean(gamma)) %>%
  arrange(desc(gamma)) %>%
  left_join(top_terms, by = "topic") %>%
  mutate(topic = paste0("Topic ", topic),
         topic = reorder(topic, gamma))

gamma_terms %>%
  select(topic, gamma, terms) %>%
  kable(digits = 3, 
        col.names = c("Topic", "Expected topic proportion", "Top 10 terms"))
Topic Expected topic proportion Top 10 terms
Topic 2 0.143 update, transactions, fix, track, issues, time, budgets, month, helpful, bank
Topic 4 0.143 budget, spending, transactions, update, money, track, add, change, card, information
Topic 7 0.143 spending, money, credit, user, support, months, information, track, month, link
Topic 6 0.143 love, update, time, track, budget, fix, spending, categories, budgeting, version
Topic 3 0.143 love, time, credit, bills, update, bank, budget, track, categories, card
Topic 1 0.143 transactions, bank, budget, month, time, track, bills, easy, version, support
Topic 5 0.143 love, budget, money, easy, bills, credit, add, bill, manually, feature

4c. Discussion

The first thing that came to my mind after looking at the results of LDA and STM was that the topics of LDA have more terms in common, while there is more variety in the words of STM topics. Having this said, I think the terms included in LDA topics could be more easily interpreted compared to STM topics as they were semantically more coherent than STM at least in analyzing this dataset. My conclusion for this comparison is that if we want terms in each topic make more sense aggregately, we should go for LDA. But if our goal is to find a hierarchy of importance in our topics we should choose STM. As an example, topic 2 in the STM model contains basically the most important words in the corpus and topic 4 shows nearly the second group of most important words.

I would like to categorize the most important topics discussed in the reviews as follows:

  1. Tracking budget and expenses: we encounter terms related to this category a lot in the LDA model such as topics 2, 7, and 5. We can also see terms relevant to budget tracking in topic 2 of STM.

  2. Category of transactions: the words related to this topic can be seen in topics 3 and 6 of LDA and topic 2 of STM. There is a feature in Mint that would automatically categorize the type of transaction you make. This has always seemed so useful to me, and I was very curious to see whether I could find something associated with it in topics or not, and interestingly, I could. This feature is especially helpful when users like to plan their expenses in a specific category such as shopping, entertainment, bills, health, etc.

  3. Application features, updates, and issues: this was highly expected to appear in topics as well. Topics 2, 4, 1, and 5 of LDA have terms directly related to the features and issues of Mint. Even though we see the term “fix” in some of these topics, we cannot discover whether users are talking about fixed issues or existing problems based on these models. We can also see some words relevant to this subject in topics 4, 6, and 3 of STM.

The last thing that I like to investigate is the existence of any hidden or apparent pattern in topics that could correlate to the success of the application. By success, I mean that many websites/online magazines such as Forbes, CNBC, and Investopedia have mentioned Mint as the best or one of the best applications for financial management. The interesting thing that I encountered was that in the models of both LDA and STM, almost no negative word can be found. For instance, in topics 4, 7, 3, 1, and 5 of LDA, I could not find a single negative word. I think this backs the high popularity of this application. To support this hypothesis, I should study the pattern of topics for applications with low ratings to see how different they are.

All in all, this was one of the most enjoyable projects I did for the course so far and I will most probably analyze the reviews of more applications in the future.

---
title: "Week 7 Independent Analysis: Topic Modeling"
output: 
  html_document:
    toc: true
    toc_depth: 3
    toc_float: yes
    code_folding: hide
    code_download: TRUE
editor_options: 
  markdown: 
    wrap: 72
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## 0. INTRODUCTION

For this independent study, I decided to analyze a new dataset
comprising the reviews of App Store users about a popular financial
management application, named "Mint." This application helps users keep
track of their income and expenses. It also provides them insights into
their credit score, bills and balances. Although there are so many
competitors for Mint, it has been able to rise to one of the top
financial management apps on the market.

------------------------------------------------------------------------

## 1. PREPARE

### 1b. Guiding Questions

During this data analysis, I liked to find an answer to three questions:

1.  What topics are mostly discussed in the reviews of a successful
    application?

2.  What finance-related topics will appear more in these reviews?

3.  How different are the results of topic modeling techniques
    (stemming, LDA vs STM) from each other?

### 1c. Set Up

First of all, we load packages needed for this walkthrough.

```{r load-packages, message=FALSE}
library(tidyverse)
library(tidytext)
library(SnowballC)
library(topicmodels)
library(stm)
library(ldatuning)
library(knitr)
library(LDAvis)
```

------------------------------------------------------------------------

## 2. WRANGLE

### 2a. Import Data From Mint

I pulled 10,000 reviews of Mint application from Apple Store using a
library named "app_store_scraper" in python and imported the data into
this project.

```{r read-csv}
mint_reviews <- read_csv("data/mint.csv", 
     col_types = cols(userName = col_character(),
                   title = col_character(), 
                   review = col_character()
                   ))
mint_reviews <- select(mint_reviews,userName,title,review)
```

### 2b. Cast a Document Term Matrix

#### Tidying Text

When I looked at different attributes in this dataset, I found out
"title" column includes useful information. Hence, I concatenated the
content of "review" and "title" columns and put the result into a new
column named "combined". Then, the data is cleaned and tokenized.

```{r tokenize-forums}
mint_reviews$combined = paste(mint_reviews$title, mint_reviews$review, sep=" ")

mint_selected <- mint_reviews %>%
  unnest_tokens(output = word, input = combined) %>%
  anti_join(stop_words, by = "word")

mint_selected <- select(mint_selected,userName, word)
mint_selected <- na.omit(mint_selected)
```

The most common words in the reviews of Mint are:

```{r count-words}
mint_selected %>%
  count(word, sort = TRUE)
```

Looking at the most frequent words, I removed some of them which were
not necessary for analysis such as "app" and "mint".

```{r}
mint_tidy <- mint_selected %>%
  select(userName, word) %>% 
  filter(!word == "app" & !word == "mint" & !word == "it’s" & !word == "i’ve" & !word == "i’m" & !word == "can’t" & !word == "don’t" & !word == "doesn’t" & !word == "won’t" & !word == "like" & !word == "see" & !word == "one" & !word == "just" & !word == "now" & !word == "use" & !word == "account" & !word == "accounts" & !word == "finances" & !word =="financial")
```

```{r}
mint_count <- mint_tidy %>%
  count(word, sort = TRUE)
```

**PS.** I realized that there is a massive difference between apostrophe
' and single-quote ' in R. It took me some time to find it out, but if
you want to filter a word like "I'm", you have to use an apostrophe;
otherwise, it would not work!

Terms like "budget," "update" and "love" are what we would have expected
to see from reviewers. However, the term "time" is not so intuitive and
worth a quick look as well. I select 10 random samples of reviews using
the `sample_n()` function for the term "time".

```{r find-quotes, echo=FALSE}
mint_quotes <- mint_reviews%>%
  select(combined) %>% 
  filter(grepl('time', combined))

sample_n(mint_quotes,10)
```

#### Creating a Document Term Matrix

We will consider each individual review as a unique "document." To do
this, we can use the attribute "userName," which is unique for each
user. Given that, we would create a document term matrix by counting the
number of times each word appears in the review of each user.

```{r cast-dtm}
mint_dtm <- mint_tidy %>%
  count(userName, word) %>%
  cast_dtm(userName, word, n)

class(mint_dtm)
mint_dtm
```

#### Processing and Stemming for STM

Let's go ahead and prepare our reviews for structural topic modeling:

```{r textProcessor}
temp <- textProcessor(mint_reviews$combined,
                    metadata = mint_reviews,  
                    lowercase=TRUE, 
                    removestopwords=TRUE, 
                    removenumbers=TRUE,  
                    removepunctuation=TRUE, 
                    wordLengths=c(3,Inf),
                    stem=TRUE,
                    onlycharacter= FALSE, 
                    striphtml=TRUE, 
                    customstopwords=c("app","mint","it’s", "i’ve","i’m","can’t","don’t", "doesn’t","won’t","like","see", "one","just" ,"now","use","account","accounts","finances","financial"))
```

```{r stm-inputs}
meta <- temp$meta
vocab <- temp$vocab
docs <- temp$documents
```

#### Stemming Tidy Text

Let's take a look at the original words and the stem that are produced:

```{r wordStem}
stemmed_mint <- mint_reviews %>%
  unnest_tokens(output = word, input = combined) %>%
  anti_join(stop_words, by = "word") %>%
  filter(!word == "app" & !word == "mint" & !word == "it’s" & !word == "i’ve" & !word == "i’m" & !word == "can’t" & !word == "don’t" & !word == "doesn’t" & !word == "won’t" & !word == "like" & !word == "see" & !word == "one" & !word == "just" & !word == "now" & !word == "use" & !word == "account" & !word == "accounts" & !word == "finances" & !word =="financial") %>%
  mutate(stem = wordStem(word))

stemmed_mint
```

We can see that words like "budgeting" that occur frequently in our
discussions have been reduced to the word stem "budget".

```{r stem-counts, echo=FALSE, message=FALSE}
stemmed_dtm <- mint_reviews %>%
  unnest_tokens(output = word, input = combined) %>%
  anti_join(stop_words, by = "word") %>%
  filter(!word == "app" & !word == "mint" & !word == "it’s" & !word == "i’ve" & !word == "i’m" & !word == "can’t" & !word == "don’t" & !word == "doesn’t" & !word == "won’t" & !word == "like" & !word == "see" & !word == "one" & !word == "just" & !word == "now" & !word == "use" & !word == "account" & !word == "accounts" & !word == "finances" & !word =="financial") %>%
  mutate(stem = wordStem(word)) %>%
  count(userName, stem, sort = TRUE) %>%
  cast_dtm(userName, stem, n)

stemmed_dtm
mint_dtm

stem_counts <- mint_reviews %>%
  unnest_tokens(output = word, input = combined) %>%
  anti_join(stop_words, by = "word") %>%
  filter(!word == "app" & !word == "mint" & !word == "it’s" & !word == "i’ve" & !word == "i’m" & !word == "can’t" & !word == "don’t" & !word == "doesn’t" & !word == "won’t" & !word == "like" & !word == "see" & !word == "one" & !word == "just" & !word == "now" & !word == "use" & !word == "account" & !word == "accounts" & !word == "finances" & !word =="financial") %>%
  mutate(stem = wordStem(word)) %>%
  count(stem, sort = TRUE)

stem_counts
```

Considering the fact that stemmed version 3,000 less rows, it may be
more reasonable to use it rather than unstemmed data.

## 3. MODEL

### 3a. Fitting a Topic Modeling with LDA

I ran findingK() function and found out 7 is reasonable for number of
topics in reviews to keep the semantic coherence high.

```{r LDA}
# n_distinct(ts_forum_data$forum_name)

mint_lda <- LDA(mint_dtm, 
                  k = 7, 
                  control = list(seed = 588)
                  )

mint_lda
```

### 3b. Fitting a Structural Topic Model

For using STM, these elements should be extracted.

```{r stm-docs}
docs <- temp$documents 
meta <- temp$meta 
vocab <- temp$vocab 
```

And now use these elements to fit the model using the same number of
topics for *K* that we specified for our LDA topic model.

```{r stm}
mint_stm <- stm(documents=docs, 
         data=meta,
         vocab=vocab, 
         K=7,
         max.em.its=25,
         verbose = FALSE)
mint_stm
```

Let's show the first 5 terms in each topic:

```{r plot-stm}
plot.STM(mint_stm, n = 10)
```

### 3c. Finding *K*

#### The FindTopicsNumber Function

I use this function to estimate the most preferable number of topics.

```{r find-topic, eval=FALSE}
k_metrics <- FindTopicsNumber(
  mint_dtm,
  topics = seq(5, 50, by = 5),
  metrics = "Griffiths2004",
  method = "Gibbs",
  control = list(),
  mc.cores = NA,
  return_models = FALSE,
  verbose = FALSE,
  libpath = NULL
)

FindTopicsNumber_plot(k_metrics)
```

As we see, k = 35 seems to be the best choice, but based on the results
I obtained from findingK() function, 7 is more likely to provide good
results. My guess is the same as I do not think there would be a variety
of topics in reviews.

#### The LDAvis Explorer

We can also use `toLDAvis()` function to generate visualizations for
exploring topic and word distributions using `LDAvis` topic browser:

```{r LDAvis}
toLDAvis(mod = mint_stm, docs = docs)
```

## 4. EXPLORE

### 4a. Exploring Beta Values

Let's take a look at the 5 most likely terms assigned to each topic.

```{r terms}
terms(mint_lda, 10)
```

Now let's look at this information visually:

```{r top_terms}
tidy_lda <- tidy(mint_lda)
top_terms <- tidy_lda %>%
  group_by(topic) %>%
  slice_max(beta, n = 10, with_ties = FALSE) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  group_by(topic, term) %>%    
  arrange(desc(beta)) %>%  
  ungroup() %>%
  ggplot(aes(beta, term, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  labs(title = "Top 5 terms in each LDA topic",
       x = expression(beta), y = NULL) +
  facet_wrap(~ topic, ncol = 4, scales = "free")
```

### 4b. Exploring Gamma Values

We can combine our beta and gamma values to understand the topic
prevalence in our corpus, and which words contribute to each topic.

```{r prevalence_table}
td_beta <- tidy(mint_lda)
td_gamma <- tidy(mint_lda, matrix = "gamma")
top_terms <- td_beta %>%
  arrange(beta) %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  arrange(-beta) %>%
  select(topic, term) %>%
  summarise(terms = list(term)) %>%
  mutate(terms = map(terms, paste, collapse = ", ")) %>% 
  unnest()

gamma_terms <- td_gamma %>%
  group_by(topic) %>%
  summarise(gamma = mean(gamma)) %>%
  arrange(desc(gamma)) %>%
  left_join(top_terms, by = "topic") %>%
  mutate(topic = paste0("Topic ", topic),
         topic = reorder(topic, gamma))

gamma_terms %>%
  select(topic, gamma, terms) %>%
  kable(digits = 3, 
        col.names = c("Topic", "Expected topic proportion", "Top 10 terms"))
```

### 4c. Discussion

The first thing that came to my mind after looking at the results of LDA
and STM was that the topics of LDA have more terms in common, while
there is more variety in the words of STM topics. Having this said, I
think the terms included in LDA topics could be more easily interpreted
compared to STM topics as they were semantically more coherent than STM
at least in analyzing this dataset. My conclusion for this comparison is
that if we want terms in each topic make more sense aggregately, we
should go for LDA. But if our goal is to find a hierarchy of importance
in our topics we should choose STM. As an example, topic 2 in the STM
model contains basically the most important words in the corpus and
topic 4 shows nearly the second group of most important words.

I would like to categorize the most important topics discussed in the
reviews as follows:

1.  **Tracking budget and expenses:** we encounter terms related to this
    category a lot in the LDA model such as topics 2, 7, and 5. We can
    also see terms relevant to budget tracking in topic 2 of STM.

2.  **Category of transactions:** the words related to this topic can be
    seen in topics 3 and 6 of LDA and topic 2 of STM. There is a feature
    in Mint that would automatically categorize the type of transaction
    you make. This has always seemed so useful to me, and I was very
    curious to see whether I could find something associated with it in
    topics or not, and interestingly, I could. This feature is
    especially helpful when users like to plan their expenses in a
    specific category such as shopping, entertainment, bills, health,
    etc.

3.  **Application features, updates, and issues:** this was highly
    expected to appear in topics as well. Topics 2, 4, 1, and 5 of LDA
    have terms directly related to the features and issues of Mint. Even
    though we see the term "fix" in some of these topics, we cannot
    discover whether users are talking about fixed issues or existing
    problems based on these models. We can also see some words relevant
    to this subject in topics 4, 6, and 3 of STM.

The last thing that I like to investigate is the existence of any hidden
or apparent pattern in topics that could correlate to the success of the
application. By success, I mean that many websites/online magazines such
as Forbes, CNBC, and Investopedia have mentioned Mint as the best or one
of the best applications for financial management. The interesting thing
that I encountered was that in the models of both LDA and STM, almost no
negative word can be found. For instance, in topics 4, 7, 3, 1, and 5 of
LDA, I could not find a single negative word. I think this backs the
high popularity of this application. To support this hypothesis, I
should study the pattern of topics for applications with low ratings to
see how different they are.

All in all, this was one of the most enjoyable projects I did for the
course so far and I will most probably analyze the reviews of more
applications in the future.
