In this project, data from latest video games popular according to IMDB is available, which explores their plots and categories. There are 20000 video game titles available with their corresponding genres, certificate information, vote count on IMDB and plot summary.
The genres are Action, Adventure, Comedy, Crime, Family, Fantasy, Mystery, Sci-fi, Thriller. The plot summary gives an outline of how the story of the game progresses and is an important factor in the popularity of a video game. This project can be useful to gain insights into the trends of game genre popularity. Through an analysis on the textual data of the plots, some questions can be answered useful to generate any catchy game titles/plots for future sales.
Loading the libraries required in the project
library(ldatuning)
## Warning: package 'ldatuning' was built under R version 4.1.3
library(stm)
## Warning: package 'stm' was built under R version 4.1.3
## stm v1.3.6 successfully loaded. See ?stm for help.
## Papers, resources, and other materials at structuraltopicmodel.com
library(keyATM)
## Warning: package 'keyATM' was built under R version 4.1.3
## keyATM 0.4.1 successfully loaded.
## Papers, examples, resources, and other materials are at
## https://keyatm.github.io/keyATM/
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.1.3
library(topicmodels)
## Warning: package 'topicmodels' was built under R version 4.1.3
library(quanteda)
## Warning: package 'quanteda' was built under R version 4.1.3
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "packedMatrix" of class "mMatrix"; definition not updated
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "packedMatrix" of class "replValueSp"; definition not updated
## Package version: 3.2.3
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.9
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.3
## Warning: package 'purrr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::group_rows() masks kableExtra::group_rows()
## x dplyr::lag() masks stats::lag()
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.1.3
library(seededlda)
## Warning: package 'seededlda' was built under R version 4.1.3
## Loading required package: proxyC
## Warning: package 'proxyC' was built under R version 4.1.3
##
## Attaching package: 'proxyC'
##
## The following object is masked from 'package:stats':
##
## dist
##
##
## Attaching package: 'seededlda'
##
## The following objects are masked from 'package:topicmodels':
##
## terms, topics
##
## The following object is masked from 'package:stats':
##
## terms
library(topicdoc)
## Warning: package 'topicdoc' was built under R version 4.1.3
library(LDAvis)
## Warning: package 'LDAvis' was built under R version 4.1.3
library(quanteda.textplots)
## Warning: package 'quanteda.textplots' was built under R version 4.1.3
library(broom)
## Warning: package 'broom' was built under R version 4.1.3
library(ggplot2)
Limiting the data to 1000 rows, as per stated requirements, and writing the resulting data in a csv file. Proceeding to read the textual data using the functional readtext to directly extract document-level variables from the data. The columns of my choice for this project are “plot” which gives the plot summary of the video game and a unique identifier docid field.
setwd("C:/CIS8045/data")
vdgames_df = read.csv("imdb-videogames.csv",nrows = 1000)
write.csv(vdgames_df,"C:/CIS8045/data/Assignment2.csv", row.names = FALSE)
vdgames_df <- readtext::readtext("Assignment2.csv",
text_field = "plot", docid_field = "X")
#glimpse(vdgames_df)
kbl(head(vdgames_df, n=15))%>%
kable_paper(bootstrap_options = "striped", full_width=F)
| doc_id | text | name | url | year | certificate | rating | votes | Action | Adventure | Comedy | Crime | Family | Fantasy | Mystery | Sci.Fi | Thriller |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | When a new villain threatens New York City, Peter Parker and Spider-Man’s worlds collide. To save the city and those he loves, he must rise up and be greater. | Spider-Man | https://www.imdb.com/title/tt5807780/?ref_=adv_li_tt | 2018 | T | 9.2 | 20,759 | TRUE | TRUE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE |
| 1 | Amidst the decline of the Wild West at the turn of the 20th century, outlaw Arthur Morgan and his gang struggle to cope with the loss of their way of life. | Red Dead Redemption II | https://www.imdb.com/title/tt6161168/?ref_=adv_li_tt | 2018 | M | 9.7 | 35,703 | TRUE | TRUE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 2 | Three very different criminals team up for a series of heists and walk into some of the most thrilling experiences in the corrupt city of Los Santos. | Grand Theft Auto V | https://www.imdb.com/title/tt2103188/?ref_=adv_li_tt | 2013 | M | 9.5 | 59,986 | TRUE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 3 | After wiping out the gods of Mount Olympus, Kratos moves on to the frigid lands of Scandinavia, where he and his son must embark on an odyssey across a dangerous world of gods and monsters. | God of War | https://www.imdb.com/title/tt5838588/?ref_=adv_li_tt | 2018 | M | 9.6 | 26,118 | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 4 | Thrown back into the dangerous underworld he’d tried to leave behind, Nathan Drake must decide what he’s willing to sacrifice to save the ones he loves. | Uncharted 4: A Thief’s End | https://www.imdb.com/title/tt3334704/?ref_=adv_li_tt | 2016 | T | 9.5 | 28,722 | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 5 | Five years after the events of The Last of Us, Ellie embarks on another journey through a post-apocalyptic America on a mission of vengeance against a mysterious militia. | The Last of Us: Part II | https://www.imdb.com/title/tt6298000/?ref_=adv_li_tt | 2020 | M | 8.5 | 30,460 | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 6 | Aloy treks into an arcane region and faces new hostile enemies and threats in search of a way to heal the world from a deadly blight and catastrophic storms. | Horizon Forbidden West | https://www.imdb.com/title/tt12496904/?ref_=adv_li_tt | 2022 | T | 9.2 | 2,979 | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE |
| 7 | In a hostile, post-pandemic world, Joel and Ellie, brought together by desperate circumstances, must rely on each other to survive a brutal journey across what remains of the United States. | The Last of Us | https://www.imdb.com/title/tt2140553/?ref_=adv_li_tt | 2013 | M | 9.7 | 60,590 | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 8 | Take control of three androids in their quest to discover who they really are. | Detroit: Become Human | https://www.imdb.com/title/tt5158314/?ref_=adv_li_tt | 2018 | M | 9.2 | 16,907 | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 9 | Deliveryman Sam Porter must travel across a ravaged wasteland and reconnect the city states of America formed after a mysterious apocalyptic event dubbed ‘death stranding’ left the world in ruins and plagued by supernatural tar creatures. | Death Stranding | https://www.imdb.com/title/tt5807606/?ref_=adv_li_tt | 2019 | M | 8.8 | 8,136 | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 10 | Set in 1274 on the Tsushima Island, the last samurai, Jin Sakai, must master a new fighting style, the way of the Ghost, to defeat the Mongol forces and fight for the freedom and independence of Japan. | Ghost of Tsushima | https://www.imdb.com/title/tt7651352/?ref_=adv_li_tt | 2020 | M | 9.3 | 8,452 | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 11 | In this sequel of Marvel’s Spider-Man (2018), you can play as Miles Morales as a new and different Spider-Man while he learns some stories about his will of fighting crime and serving justice by his mentor and former hero, Peter Parker. | Spider-Man: Miles Morales | https://www.imdb.com/title/tt12496734/?ref_=adv_li_tt | 2020 | T | 8.5 | 5,835 | TRUE | TRUE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE |
| 12 | In Night City, a mercenary known as V navigates a dystopian society in which the line between humanity and technology becomes blurred. | Cyberpunk 2077 | https://www.imdb.com/title/tt3810192/?ref_=adv_li_tt | 2020 | M | 8.0 | 8,118 | TRUE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE |
| 13 | Treasure hunter Nathan Drake, embarks in the adventure of his life searching for the legendary treasure, El Dorado while fighting a group of mercenaries. | Uncharted: Drake’s Fortune | https://www.imdb.com/title/tt1000777/?ref_=adv_li_tt | 2007 | T | 8.5 | 20,343 | TRUE | TRUE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE |
| 14 | With his back against the wall, Batman turns to his closest allies to help him save Gotham City from the clutches of Scarecrow and the Arkham Knight’s army. A familiar face also returns to give The Dark Knight a message he cannot ignore. | Batman: Arkham Knight | https://www.imdb.com/title/tt3554580/?ref_=adv_li_tt | 2015 | M | 9.0 | 18,970 | TRUE | TRUE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE |
Using Quanteda package to create corpus from the above data. Next, I create tokens and document-feature matrix. The resulting corpus is a static container of the texts from this video games data. Data from the corpus won’t be and can’t be used for cleaning/pre-processing through stemming etc. Rather, the corpus vdgames_corp will serve as a reference copy for further extraction to create new objects for required analysis on this video games data.
To segment the data from the corpus, I use tokens and tidy up the data. I also make a vector of some stop words that may be repeated in the data and not as relevant to the final result. vdgames_dfm is the resulting feature-matrix.
vdgames_corp <- corpus(vdgames_df)
summary(vdgames_corp, n = 5)
## Corpus consisting of 1000 documents, showing 5 documents:
##
## Text Types Tokens Sentences name
## 0 28 33 2 Spider-Man
## 1 26 33 1 Red Dead Redemption II
## 2 25 28 1 Grand Theft Auto V
## 3 31 38 1 God of War
## 4 25 28 1 Uncharted 4: A Thief's End
## url year certificate rating
## https://www.imdb.com/title/tt5807780/?ref_=adv_li_tt 2018 T 9.2
## https://www.imdb.com/title/tt6161168/?ref_=adv_li_tt 2018 M 9.7
## https://www.imdb.com/title/tt2103188/?ref_=adv_li_tt 2013 M 9.5
## https://www.imdb.com/title/tt5838588/?ref_=adv_li_tt 2018 M 9.6
## https://www.imdb.com/title/tt3334704/?ref_=adv_li_tt 2016 T 9.5
## votes Action Adventure Comedy Crime Family Fantasy Mystery Sci.Fi Thriller
## 20,759 TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## 35,703 TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## 59,986 TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## 26,118 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 28,722 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
vdgames_toks <- tokens(
vdgames_corp,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE,
remove_url = TRUE,
split_hyphens = FALSE)
myStopWords = c("he", "when", "and", "in", "the", "by","to", "his", "is", "must",
"over", "out", "during", "of", "can", "through", "so", "where","set",
"based", "over", "than", "this", "new", "city", "fight", "save",
"three", "world", "find", "take", "known","one", "help", "called",
"summary", "story", "first", "stop", "worlds","game","player","play",
"team","see","plot","players","follows","characters","help","two","â", "friends", "series", "video")
vdgames_toks2 <- tokens_remove(
vdgames_toks, pattern = c(stopwords("en"), myStopWords))
vdgames_dfm <- dfm(vdgames_toks2, tolower = TRUE) %>%
dfm_remove(c(stopwords("en"), myStopWords)) %>%
dfm_trim(min_termfreq = 5, min_docfreq = 10)
vdgames_dfm
## Document-feature matrix of: 1,000 documents, 248 features (98.30% sparse) and 15 docvars.
## features
## docs york rise century gang way life different criminals lands son
## 0 1 1 0 0 0 0 0 0 0 0
## 1 0 0 1 1 1 1 0 0 0 0
## 2 0 0 0 0 0 0 1 1 0 0
## 3 0 0 0 0 0 0 0 0 1 1
## 4 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0
## [ reached max_ndoc ... 994 more documents, reached max_nfeat ... 238 more features ]
Selecting top features by getting number of number of documents and features through ndoc() and nfeat(). Analyzing some relevant keywords using the function kwic() to show those keywords in context, with their neighbouring words. For my analysis, I have the keywords “quest”, “mysterious”, “planet”.
ndoc(vdgames_dfm)
## [1] 1000
nfeat(vdgames_dfm)
## [1] 248
topfeatures(vdgames_dfm, 30)
## war years time full mysterious evil way
## 79 55 54 50 49 45 44
## back dark order young events battle power
## 44 43 42 42 41 41 38
## group across control journey life island forces
## 37 35 35 34 33 31 31
## adventure land takes quest planet powerful empire
## 31 30 30 29 29 28 27
## now earth
## 27 27
keyword1 <- kwic(vdgames_toks2, pattern = phrase("quest*"), window = 2)
head(keyword1, 5)
## Keyword-in-context with 5 matches.
## [8, 3] control androids | quest | discover really
## [27, 7] Drake goes | quest | Marco Polo's
## [31, 14] warrior goes | quest | learn truth
## [59, 6] Drake embarks | quest | search Atlantis
## [148, 16] Tony survives | quest | kill Sosa
keyword2 <- kwic(vdgames_toks2, pattern = phrase("mysterious*"), window = 2)
head(keyword2, 5)
## Keyword-in-context with 5 matches.
## [5, 14] mission vengeance | mysterious | militia
## [9, 12] America formed | mysterious | apocalyptic event
## [31, 17] learn truth | mysterious | origin state
## [48, 14] ultimately leads | mysterious | village
## [72, 14] GrabPack Explore | mysterious | facility get
keyword3 <- kwic(vdgames_toks, pattern = phrase("planet*"), window = 2)
head(keyword3, 5)
## Keyword-in-context with 5 matches.
## [34, 31] protect the | planet | and all
## [109, 15] beyond the | planet | Pandora and
## [146, 17] in the | planets | maximum security
## [169, 15] rid the | planet | of invading
## [170, 6] enters the | planet's | orbit and
The feature co-occurrence matrix (FCM) will record the number of co-occurrences of tokens, and I use the words which have occurred at least 35 times in my analysis. There are 18 such words and I visualize a semantic network analysis of those words using textplot_network
vdgames_dfm_small <- dfm_trim(vdgames_dfm, min_termfreq = 35)
nfeat(vdgames_dfm_small)
## [1] 17
vdgames_fcm <- fcm(vdgames_dfm_small)
feat <- names(topfeatures(vdgames_fcm, 30))
fcmat_select <- fcm_select(vdgames_fcm, pattern = feat, selection = "keep")
size <- log(colSums(dfm_select(vdgames_fcm, feat, selection = "keep")))
set.seed(123)
textplot_network(fcmat_select, min_freq = 0.5, edge_size = 2, edge_color="darkseagreen",
vertex_size = size/max(size)*3)
Using Latent Dirichlet Allocation (LDA) for topic modelling to organize the above data into themes. Visualizing the top 5 keywords in each topic
vdgames_dtmat = quanteda::convert(vdgames_dfm, to="topicmodels")
vdgames_lda5 <- LDA(vdgames_dtmat, k = 5, control = list(seed = 123))
vdgames_lda5_betas <- broom::tidy(vdgames_lda5)
top_terms_in_topics <- vdgames_lda5_betas %>%
group_by(topic) %>%
top_n(5, beta) %>%
ungroup() %>%
arrange(topic, -beta)
#top_terms_in_topics
kbl(head(top_terms_in_topics, n=10))%>%
kable_paper(bootstrap_options = "striped", full_width=F)
| topic | term | beta |
|---|---|---|
| 1 | mysterious | 0.0258181 |
| 1 | time | 0.0255892 |
| 1 | young | 0.0212622 |
| 1 | battle | 0.0175506 |
| 1 | evil | 0.0166833 |
| 2 | years | 0.0238485 |
| 2 | power | 0.0219938 |
| 2 | dark | 0.0218196 |
| 2 | empire | 0.0181477 |
| 2 | across | 0.0169756 |
top_terms_in_topics %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
scale_fill_brewer(palette="YlGn")+
facet_wrap(~ topic, scales = "free") +
coord_flip()
test = subset(vdgames_df)
nrow(test)
## [1] 1000
To ensure good model performance of this LDA model, it is better to verify the model performance based on unseen data through cross-validation to identify the best number of topics.
train_vdgames_dtmat <- corpus_subset(vdgames_corp) %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE,
remove_symbols = TRUE, remove_url = TRUE) %>%
dfm(tolower = TRUE) %>%
dfm_remove(c(stopwords("en"), myStopWords)) %>%
dfm_trim(min_termfreq = 5, min_docfreq = 10) %>%
quanteda::convert(to="topicmodels")
test_vdgames_dtmat <- corpus_subset(vdgames_corp) %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE,
remove_symbols = TRUE, remove_url = TRUE) %>%
dfm(tolower = TRUE) %>%
dfm_remove(c(stopwords("en"), myStopWords)) %>%
dfm_trim(min_termfreq = 5, min_docfreq = 10) %>%
quanteda::convert(to="topicmodels")
Perplexity is a common metric used to evaluate NLP and language models and for a better fit model we aim for a lower perplexity. Here, I am testing perplexity when k=5 and plotting the result
train_vdgames_lda5 <- LDA(train_vdgames_dtmat, k = 5, control = list(seed = 123))
perplexity(train_vdgames_lda5, test_vdgames_dtmat)
## [1] 223.7036
n_topics_vec = 2:5
perplexity_vec = map_dbl(n_topics_vec, function(kk) {
message(kk)
train_vdgames_ldaK <- LDA(train_vdgames_dtmat, k = kk, control = list(seed = 123))
perp = perplexity(train_vdgames_ldaK, test_vdgames_dtmat)
})
## 2
## 3
## 4
## 5
lda_perplexity_result = tibble(
n_topics = n_topics_vec, perplexity = perplexity_vec
)
plot(lda_perplexity_result, type="l")
Using ldatuning to find the best number of topics based on the “CaoJuan2009”,“Arun2010”, and “Deveaud2014” measures as per requirement.
lda_ldatuning_result <- FindTopicsNumber(
vdgames_dtmat, topics = n_topics_vec,
metrics = c("CaoJuan2009", "Arun2010", "Deveaud2014"), method = "VEM",
control = list(seed = 123), mc.cores = 4L, verbose = TRUE)
## fit models... done.
## calculate metrics:
## CaoJuan2009... done.
## Arun2010... done.
## Deveaud2014... done.
FindTopicsNumber_plot(lda_ldatuning_result)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
Based on one or more of the metrics, a vote between all values of k is chosen. As CaoJuan2009, Arun2010 are minimizers, we would want the lower values. Arun2010 has a low value at k=5 and Caojuan2009 has a low value at k=2. Deveaud2014 is a miaximiser and we would want a higher value. The highest value occurs at k=5. Thus, observation from the above graph is that the 5-topic LDA performs the best. Fitting the resulting LDA model and showing topic-specific diagnostics using topicdoc package.
vdgames_lda3 <- LDA(vdgames_dtmat, k = 3, control = list(seed = 123))
topicdoc_result = topic_diagnostics(vdgames_lda3, vdgames_dtmat)
#view(topicdoc_result)
kbl(head(topicdoc_result, n=10))%>%
kable_paper(bootstrap_options = "striped", full_width=F)
| topic_num | topic_size | mean_token_length | dist_from_corpus | tf_df_dist | doc_prominence | topic_coherence | topic_exclusivity |
|---|---|---|---|---|---|---|---|
| 1 | 82.55511 | 5.3 | 0.2354626 | 0.5653146 | 985 | -210.2222 | 9.350212 |
| 2 | 85.13868 | 5.5 | 0.2406158 | 0.3303270 | 985 | -216.2282 | 8.981297 |
| 3 | 80.30621 | 5.2 | 0.2299487 | 0.3914156 | 985 | -217.5405 | 9.398013 |
Using STM for improving inference and qualitative interpretability by affecting topical prevalence, topic content, or both of the abve video games data. Exploring the result by LDAvis package
stm_vdgamesdfmat <- quanteda::convert(vdgames_dfm, to = "stm")
## Warning in dfm2stm(x, docvars, omit_empty = TRUE): Dropped empty document(s):
## 23, 213, 226, 310, 376, 439, 454, 635, 731, 780, 787, 825, 854, 935, 959
out <- prepDocuments( stm_vdgamesdfmat$documents,
stm_vdgamesdfmat$vocab,
stm_vdgamesdfmat$meta)
vdgames_tmob_stm <- stm(out$documents, out$vocab,K=3,
seed=123,emtol=1e-3, max.em.its=150)
## Beginning Spectral Initialization
## Calculating the gram matrix...
## Finding anchor words...
## ...
## Recovering initialization...
## ..
## Initialization complete.
## .............................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 1 (approx. per word bound = -5.661)
## .............................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 2 (approx. per word bound = -5.518, relative change = 2.539e-02)
## .............................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 3 (approx. per word bound = -5.466, relative change = 9.440e-03)
## .............................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 4 (approx. per word bound = -5.439, relative change = 4.850e-03)
## .............................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 5 (approx. per word bound = -5.424, relative change = 2.723e-03)
## Topic 1: years, way, back, dark, battle
## Topic 2: young, group, journey, island, power
## Topic 3: war, time, evil, events, life
## .............................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 6 (approx. per word bound = -5.415, relative change = 1.658e-03)
## .............................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 7 (approx. per word bound = -5.409, relative change = 1.080e-03)
## .............................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Model Converged
toLDAvis(mod=vdgames_tmob_stm, docs=out$documents)
## Loading required namespace: servr
plot(vdgames_tmob_stm, type="summary", n=5)
topicQuality(vdgames_tmob_stm, out$documents)
## [1] -172.9114 -228.3673 -198.8188
## [1] 8.885934 8.680275 8.422684
keyATM_docs <- keyATM_read(texts = vdgames_dfm)
## Using quanteda dfm.
## Warning in get_doc_index(W_raw, check = TRUE): Number of documents with 0 length: 15
## This may cause invalid covariates or time index.
## Please review the preprocessing steps.
## Document index to check: 24, 214, 227, 311, 377, 440, 455, 636, 732, 781, 788, 826, 855, 936, 960
summary(keyATM_docs)
## keyATM_docs object of: 1000 documents.
## Length of documents:
## Avg: 4.363
## Min: 0
## Max: 12
## SD: 2.322
## Number of unique words: 2123
Comparing topic qualities, Topic 2 has medium Exclusivity and high Semantic Coherence. Topic 2 has words like “war”, “battle”,“army”,“military” which shows that most popular video games are Action-based where the plots are centered around fighting, combat, battles and similar themes.
Visualizing keywords associated with 3 topics and fitting a keyATM base model. Selecting 3 topics and visualizing 5 words from those topics
vdgames_key_list = list(
dark = c("young", "power", "journey", "land", "earth"),
war = c("years","order","battle", "way","group"),
mysterious=c("time","evil","back","control","island")
)
vdgames_key_viz <- visualize_keywords(docs = keyATM_docs, keywords = vdgames_key_list)
vdgames_key_viz
vdgames_tmod_keyatm_base <- keyATM(
docs = keyATM_docs,
no_keyword_topics = 3,
keywords = vdgames_key_list,
model = "base",
options = list(seed = 123))
## Warning in keyATM_fit(docs, model, no_keyword_topics, keywords,
## model_settings, : Some documents have 0 length. Please review the preprocessing
## steps.
## Initializing the model...
## Fitting the model. 1500 iterations...
## Creating an output object. It may take time...
top_words(vdgames_tmod_keyatm_base, 5)
## 1_dark 2_war 3_mysterious Other_1 Other_2
## 1 young [<U+2713>] war time [<U+2713>] duty add
## 2 land [<U+2713>] years [<U+2713>] back [<U+2713>] call york
## 3 power [<U+2713>] mysterious evil [<U+2713>] star rise
## 4 king full control [<U+2713>] wars century
## 5 journey [<U+2713>] way [<U+2713>] mario battle [2] gang
## Other_3
## 1 marvel
## 2 universe
## 3 heroes
## 4 villains
## 5 order [2]
kable(top_words(vdgames_tmod_keyatm_base, 5), caption = "Top 5 keywords")
| 1_dark | 2_war | 3_mysterious | Other_1 | Other_2 | Other_3 |
|---|---|---|---|---|---|
| young [<U+2713>] | war | time [<U+2713>] | duty | add | marvel |
| land [<U+2713>] | years [<U+2713>] | back [<U+2713>] | call | york | universe |
| power [<U+2713>] | mysterious | evil [<U+2713>] | star | rise | heroes |
| king | full | control [<U+2713>] | wars | century | villains |
| journey [<U+2713>] | way [<U+2713>] | mario | battle [2] | gang | order [2] |
Thus, this project can be useful to interpret similar results out of descriptions of video games, their plots and which themes are more popular. It is also possible to use a similar textual exploratory analysis on other products.