For this independent study, I decided to analyze a new dataset comprising the reviews of App Store users about a popular financial management application, named “Mint.” This application helps users keep track of their income and expenses. It also provides them insights into their credit score, bills and balances. Although there are so many competitors for Mint, it has been able to rise to one of the top financial management apps on the market.
During this data analysis, I liked to find an answer to three questions:
What topics are mostly discussed in the reviews of a successful application?
What finance-related topics will appear more in these reviews?
How different are the results of topic modeling techniques (stemming, LDA vs STM) from each other?
First of all, we load packages needed for this walkthrough.
library(tidyverse)
library(tidytext)
library(SnowballC)
library(topicmodels)
library(stm)
library(ldatuning)
library(knitr)
library(LDAvis)
I pulled 10,000 reviews of Mint application from Apple Store using a library named “app_store_scraper” in python and imported the data into this project.
mint_reviews <- read_csv("data/mint.csv",
col_types = cols(userName = col_character(),
title = col_character(),
review = col_character()
))
mint_reviews <- select(mint_reviews,userName,title,review)
When I looked at different attributes in this dataset, I found out “title” column includes useful information. Hence, I concatenated the content of “review” and “title” columns and put the result into a new column named “combined”. Then, the data is cleaned and tokenized.
mint_reviews$combined = paste(mint_reviews$title, mint_reviews$review, sep=" ")
mint_selected <- mint_reviews %>%
unnest_tokens(output = word, input = combined) %>%
anti_join(stop_words, by = "word")
mint_selected <- select(mint_selected,userName, word)
mint_selected <- na.omit(mint_selected)
The most common words in the reviews of Mint are:
mint_selected %>%
count(word, sort = TRUE)
## # A tibble: 9,691 x 2
## word n
## <chr> <int>
## 1 app 12661
## 2 mint 5067
## 3 accounts 3572
## 4 it’s 2678
## 5 love 2438
## 6 update 2318
## 7 budget 2313
## 8 account 2270
## 9 time 2061
## 10 credit 1704
## # ... with 9,681 more rows
Looking at the most frequent words, I removed some of them which were not necessary for analysis such as “app” and “mint”.
mint_tidy <- mint_selected %>%
select(userName, word) %>%
filter(!word == "app" & !word == "mint" & !word == "it’s" & !word == "i’ve" & !word == "i’m" & !word == "can’t" & !word == "don’t" & !word == "doesn’t" & !word == "won’t" & !word == "like" & !word == "see" & !word == "one" & !word == "just" & !word == "now" & !word == "use" & !word == "account" & !word == "accounts" & !word == "finances" & !word =="financial")
mint_count <- mint_tidy %>%
count(word, sort = TRUE)
PS. I realized that there is a massive difference between apostrophe ’ and single-quote ’ in R. It took me some time to find it out, but if you want to filter a word like “I’m”, you have to use an apostrophe; otherwise, it would not work!
Terms like “budget,” “update” and “love” are what we would have expected to see from reviewers. However, the term “time” is not so intuitive and worth a quick look as well. I select 10 random samples of reviews using the sample_n() function for the term “time”.
## # A tibble: 10 x 1
## combined
## <chr>
## 1 "Used to be good This app and its support have gone the way of so many good ~
## 2 "Renewed Interface is not intuitive I think Mint actually updated the interf~
## 3 "Not as intuitive as phone or web app The layout and functionality of the iP~
## 4 "Good and bad I really like the layout and the quick view. The ability to co~
## 5 "Ok Devs, time to update!! Been using Mint for years now. Amazing app, good ~
## 6 "Good, but could be so much better I switched to Mint from Personal Cap a fe~
## 7 "stash i love mint and as advertised on mint, i use stash as well. since sta~
## 8 "Looking for a real update The app is good; it works well most of the time. ~
## 9 "Great idea, poor execution Everyone knows Mint’s basic purpose and it’s a g~
## 10 "Super helpful for tracking expenses I use it all the time"
We will consider each individual review as a unique “document.” To do this, we can use the attribute “userName,” which is unique for each user. Given that, we would create a document term matrix by counting the number of times each word appears in the review of each user.
mint_dtm <- mint_tidy %>%
count(userName, word) %>%
cast_dtm(userName, word, n)
class(mint_dtm)
## [1] "DocumentTermMatrix" "simple_triplet_matrix"
mint_dtm
## <<DocumentTermMatrix (documents: 9994, terms: 9678)>>
## Non-/sparse entries: 158369/96563563
## Sparsity : 100%
## Maximal term length: 23
## Weighting : term frequency (tf)
Let’s go ahead and prepare our reviews for structural topic modeling:
temp <- textProcessor(mint_reviews$combined,
metadata = mint_reviews,
lowercase=TRUE,
removestopwords=TRUE,
removenumbers=TRUE,
removepunctuation=TRUE,
wordLengths=c(3,Inf),
stem=TRUE,
onlycharacter= FALSE,
striphtml=TRUE,
customstopwords=c("app","mint","it’s", "i’ve","i’m","can’t","don’t", "doesn’t","won’t","like","see", "one","just" ,"now","use","account","accounts","finances","financial"))
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Remove Custom Stopwords...
## Removing numbers...
## Stemming...
## Creating Output...
meta <- temp$meta
vocab <- temp$vocab
docs <- temp$documents
Let’s take a look at the original words and the stem that are produced:
stemmed_mint <- mint_reviews %>%
unnest_tokens(output = word, input = combined) %>%
anti_join(stop_words, by = "word") %>%
filter(!word == "app" & !word == "mint" & !word == "it’s" & !word == "i’ve" & !word == "i’m" & !word == "can’t" & !word == "don’t" & !word == "doesn’t" & !word == "won’t" & !word == "like" & !word == "see" & !word == "one" & !word == "just" & !word == "now" & !word == "use" & !word == "account" & !word == "accounts" & !word == "finances" & !word =="financial") %>%
mutate(stem = wordStem(word))
stemmed_mint
## # A tibble: 179,771 x 5
## userName title review word stem
## <chr> <chr> <chr> <chr> <chr>
## 1 Mere67193 Using for years and love it "Hi there - I have been u~ love love
## 2 Mere67193 Using for years and love it "Hi there - I have been u~ budge~ budg~
## 3 Mere67193 Using for years and love it "Hi there - I have been u~ keepi~ keep
## 4 Mere67193 Using for years and love it "Hi there - I have been u~ track track
## 5 Mere67193 Using for years and love it "Hi there - I have been u~ credit cred~
## 6 Mere67193 Using for years and love it "Hi there - I have been u~ score score
## 7 Mere67193 Using for years and love it "Hi there - I have been u~ couple coupl
## 8 Mere67193 Using for years and love it "Hi there - I have been u~ score score
## 9 Mere67193 Using for years and love it "Hi there - I have been u~ grown grown
## 10 Mere67193 Using for years and love it "Hi there - I have been u~ 50 50
## # ... with 179,761 more rows
We can see that words like “budgeting” that occur frequently in our discussions have been reduced to the word stem “budget”.
## <<DocumentTermMatrix (documents: 9994, terms: 6394)>>
## Non-/sparse entries: 152439/63749197
## Sparsity : 100%
## Maximal term length: 23
## Weighting : term frequency (tf)
## <<DocumentTermMatrix (documents: 9994, terms: 9678)>>
## Non-/sparse entries: 158369/96563563
## Sparsity : 100%
## Maximal term length: 23
## Weighting : term frequency (tf)
## # A tibble: 6,394 x 2
## stem n
## <chr> <int>
## 1 budget 4017
## 2 updat 3554
## 3 love 2701
## 4 time 2651
## 5 track 2178
## 6 transact 2068
## 7 spend 2047
## 8 bill 2033
## 9 month 1788
## 10 bank 1754
## # ... with 6,384 more rows
Considering the fact that stemmed version 3,000 less rows, it may be more reasonable to use it rather than unstemmed data.
I ran findingK() function and found out 7 is reasonable for number of topics in reviews to keep the semantic coherence high.
# n_distinct(ts_forum_data$forum_name)
mint_lda <- LDA(mint_dtm,
k = 7,
control = list(seed = 588)
)
mint_lda
## A LDA_VEM topic model with 7 topics.
For using STM, these elements should be extracted.
docs <- temp$documents
meta <- temp$meta
vocab <- temp$vocab
And now use these elements to fit the model using the same number of topics for K that we specified for our LDA topic model.
mint_stm <- stm(documents=docs,
data=meta,
vocab=vocab,
K=7,
max.em.its=25,
verbose = FALSE)
mint_stm
## A topic model with 7 topics, 10000 documents and a 8025 word dictionary.
Let’s show the first 5 terms in each topic:
plot.STM(mint_stm, n = 10)
I use this function to estimate the most preferable number of topics.
k_metrics <- FindTopicsNumber(
mint_dtm,
topics = seq(5, 50, by = 5),
metrics = "Griffiths2004",
method = "Gibbs",
control = list(),
mc.cores = NA,
return_models = FALSE,
verbose = FALSE,
libpath = NULL
)
FindTopicsNumber_plot(k_metrics)
As we see, k = 35 seems to be the best choice, but based on the results I obtained from findingK() function, 7 is more likely to provide good results. My guess is the same as I do not think there would be a variety of topics in reviews.
We can also use toLDAvis() function to generate visualizations for exploring topic and word distributions using LDAvis topic browser:
toLDAvis(mod = mint_stm, docs = docs)
## Loading required namespace: servr
Let’s take a look at the 5 most likely terms assigned to each topic.
terms(mint_lda, 10)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
## [1,] "transactions" "update" "love" "budget" "love"
## [2,] "bank" "transactions" "time" "spending" "budget"
## [3,] "budget" "fix" "credit" "transactions" "money"
## [4,] "month" "track" "bills" "update" "easy"
## [5,] "time" "issues" "update" "money" "bills"
## [6,] "track" "time" "bank" "track" "credit"
## [7,] "bills" "budgets" "budget" "add" "add"
## [8,] "easy" "month" "track" "change" "bill"
## [9,] "version" "helpful" "categories" "card" "manually"
## [10,] "support" "bank" "card" "information" "feature"
## Topic 6 Topic 7
## [1,] "love" "spending"
## [2,] "update" "money"
## [3,] "time" "credit"
## [4,] "track" "user"
## [5,] "budget" "support"
## [6,] "fix" "months"
## [7,] "spending" "information"
## [8,] "categories" "track"
## [9,] "budgeting" "month"
## [10,] "version" "link"
Now let’s look at this information visually:
tidy_lda <- tidy(mint_lda)
top_terms <- tidy_lda %>%
group_by(topic) %>%
slice_max(beta, n = 10, with_ties = FALSE) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
group_by(topic, term) %>%
arrange(desc(beta)) %>%
ungroup() %>%
ggplot(aes(beta, term, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
scale_y_reordered() +
labs(title = "Top 5 terms in each LDA topic",
x = expression(beta), y = NULL) +
facet_wrap(~ topic, ncol = 4, scales = "free")
We can combine our beta and gamma values to understand the topic prevalence in our corpus, and which words contribute to each topic.
td_beta <- tidy(mint_lda)
td_gamma <- tidy(mint_lda, matrix = "gamma")
top_terms <- td_beta %>%
arrange(beta) %>%
group_by(topic) %>%
top_n(10, beta) %>%
arrange(-beta) %>%
select(topic, term) %>%
summarise(terms = list(term)) %>%
mutate(terms = map(terms, paste, collapse = ", ")) %>%
unnest()
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(terms)`
gamma_terms <- td_gamma %>%
group_by(topic) %>%
summarise(gamma = mean(gamma)) %>%
arrange(desc(gamma)) %>%
left_join(top_terms, by = "topic") %>%
mutate(topic = paste0("Topic ", topic),
topic = reorder(topic, gamma))
gamma_terms %>%
select(topic, gamma, terms) %>%
kable(digits = 3,
col.names = c("Topic", "Expected topic proportion", "Top 10 terms"))
| Topic | Expected topic proportion | Top 10 terms |
|---|---|---|
| Topic 2 | 0.143 | update, transactions, fix, track, issues, time, budgets, month, helpful, bank |
| Topic 4 | 0.143 | budget, spending, transactions, update, money, track, add, change, card, information |
| Topic 7 | 0.143 | spending, money, credit, user, support, months, information, track, month, link |
| Topic 6 | 0.143 | love, update, time, track, budget, fix, spending, categories, budgeting, version |
| Topic 3 | 0.143 | love, time, credit, bills, update, bank, budget, track, categories, card |
| Topic 1 | 0.143 | transactions, bank, budget, month, time, track, bills, easy, version, support |
| Topic 5 | 0.143 | love, budget, money, easy, bills, credit, add, bill, manually, feature |
The first thing that came to my mind after looking at the results of LDA and STM was that the topics of LDA have more terms in common, while there is more variety in the words of STM topics. Having this said, I think the terms included in LDA topics could be more easily interpreted compared to STM topics as they were semantically more coherent than STM at least in analyzing this dataset. My conclusion for this comparison is that if we want terms in each topic make more sense aggregately, we should go for LDA. But if our goal is to find a hierarchy of importance in our topics we should choose STM. As an example, topic 2 in the STM model contains basically the most important words in the corpus and topic 4 shows nearly the second group of most important words.
I would like to categorize the most important topics discussed in the reviews as follows:
Tracking budget and expenses: we encounter terms related to this category a lot in the LDA model such as topics 2, 7, and 5. We can also see terms relevant to budget tracking in topic 2 of STM.
Category of transactions: the words related to this topic can be seen in topics 3 and 6 of LDA and topic 2 of STM. There is a feature in Mint that would automatically categorize the type of transaction you make. This has always seemed so useful to me, and I was very curious to see whether I could find something associated with it in topics or not, and interestingly, I could. This feature is especially helpful when users like to plan their expenses in a specific category such as shopping, entertainment, bills, health, etc.
Application features, updates, and issues: this was highly expected to appear in topics as well. Topics 2, 4, 1, and 5 of LDA have terms directly related to the features and issues of Mint. Even though we see the term “fix” in some of these topics, we cannot discover whether users are talking about fixed issues or existing problems based on these models. We can also see some words relevant to this subject in topics 4, 6, and 3 of STM.
The last thing that I like to investigate is the existence of any hidden or apparent pattern in topics that could correlate to the success of the application. By success, I mean that many websites/online magazines such as Forbes, CNBC, and Investopedia have mentioned Mint as the best or one of the best applications for financial management. The interesting thing that I encountered was that in the models of both LDA and STM, almost no negative word can be found. For instance, in topics 4, 7, 3, 1, and 5 of LDA, I could not find a single negative word. I think this backs the high popularity of this application. To support this hypothesis, I should study the pattern of topics for applications with low ratings to see how different they are.
All in all, this was one of the most enjoyable projects I did for the course so far and I will most probably analyze the reviews of more applications in the future.