Asssignment2-Misra-VideoGamesPlots

Plots of Latest Popular Video Games: Text Data Analysis

In this project, data from latest video games popular according to IMDB is available, which explores their plots and categories. There are 20000 video game titles available with their corresponding genres, certificate information, vote count on IMDB and plot summary.

The genres are Action, Adventure, Comedy, Crime, Family, Fantasy, Mystery, Sci-fi, Thriller. The plot summary gives an outline of how the story of the game progresses and is an important factor in the popularity of a video game. This project can be useful to gain insights into the trends of game genre popularity. Through an analysis on the textual data of the plots, some questions can be answered useful to generate any catchy game titles/plots for future sales.

Project Coding

Loading the libraries required in the project

library(ldatuning)

## Warning: package 'ldatuning' was built under R version 4.1.3

library(stm)

## Warning: package 'stm' was built under R version 4.1.3

## stm v1.3.6 successfully loaded. See ?stm for help. 
##  Papers, resources, and other materials at structuraltopicmodel.com

library(keyATM)

## Warning: package 'keyATM' was built under R version 4.1.3

## keyATM 0.4.1 successfully loaded.
##  Papers, examples, resources, and other materials are at
##  https://keyatm.github.io/keyATM/

library(kableExtra)

## Warning: package 'kableExtra' was built under R version 4.1.3

library(topicmodels)

## Warning: package 'topicmodels' was built under R version 4.1.3

library(quanteda)

## Warning: package 'quanteda' was built under R version 4.1.3

## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "packedMatrix" of class "mMatrix"; definition not updated

## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "packedMatrix" of class "replValueSp"; definition not updated

## Package version: 3.2.3
## Unicode version: 13.0
## ICU version: 69.1

## Parallel computing: 8 of 8 threads used.

## See https://quanteda.io for tutorials and examples.

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.3

## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --

## v ggplot2 3.3.6     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.9
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## Warning: package 'ggplot2' was built under R version 4.1.3

## Warning: package 'purrr' was built under R version 4.1.3

## Warning: package 'dplyr' was built under R version 4.1.3

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter()     masks stats::filter()
## x dplyr::group_rows() masks kableExtra::group_rows()
## x dplyr::lag()        masks stats::lag()

library(tidytext)

## Warning: package 'tidytext' was built under R version 4.1.3

library(seededlda)

## Warning: package 'seededlda' was built under R version 4.1.3

## Loading required package: proxyC

## Warning: package 'proxyC' was built under R version 4.1.3

## 
## Attaching package: 'proxyC'
## 
## The following object is masked from 'package:stats':
## 
##     dist
## 
## 
## Attaching package: 'seededlda'
## 
## The following objects are masked from 'package:topicmodels':
## 
##     terms, topics
## 
## The following object is masked from 'package:stats':
## 
##     terms

library(topicdoc)

## Warning: package 'topicdoc' was built under R version 4.1.3

library(LDAvis)

## Warning: package 'LDAvis' was built under R version 4.1.3

library(quanteda.textplots)

## Warning: package 'quanteda.textplots' was built under R version 4.1.3

library(broom)

## Warning: package 'broom' was built under R version 4.1.3

library(ggplot2)

Reading the text from the data

Limiting the data to 1000 rows, as per stated requirements, and writing the resulting data in a csv file. Proceeding to read the textual data using the functional readtext to directly extract document-level variables from the data. The columns of my choice for this project are “plot” which gives the plot summary of the video game and a unique identifier docid field.

setwd("C:/CIS8045/data")
vdgames_df = read.csv("imdb-videogames.csv",nrows = 1000)

write.csv(vdgames_df,"C:/CIS8045/data/Assignment2.csv", row.names = FALSE)

vdgames_df <- readtext::readtext("Assignment2.csv", 
                                 text_field = "plot", docid_field = "X")
#glimpse(vdgames_df)
kbl(head(vdgames_df, n=15))%>%
  kable_paper(bootstrap_options = "striped", full_width=F)

doc_id	text	name	url	year	certificate	rating	votes	Action	Adventure	Comedy	Crime	Family	Fantasy	Mystery	Sci.Fi	Thriller
0	When a new villain threatens New York City, Peter Parker and Spider-Man’s worlds collide. To save the city and those he loves, he must rise up and be greater.	Spider-Man	https://www.imdb.com/title/tt5807780/?ref_=adv_li_tt	2018	T	9.2	20,759	TRUE	TRUE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE
1	Amidst the decline of the Wild West at the turn of the 20th century, outlaw Arthur Morgan and his gang struggle to cope with the loss of their way of life.	Red Dead Redemption II	https://www.imdb.com/title/tt6161168/?ref_=adv_li_tt	2018	M	9.7	35,703	TRUE	TRUE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE
2	Three very different criminals team up for a series of heists and walk into some of the most thrilling experiences in the corrupt city of Los Santos.	Grand Theft Auto V	https://www.imdb.com/title/tt2103188/?ref_=adv_li_tt	2013	M	9.5	59,986	TRUE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE
3	After wiping out the gods of Mount Olympus, Kratos moves on to the frigid lands of Scandinavia, where he and his son must embark on an odyssey across a dangerous world of gods and monsters.	God of War	https://www.imdb.com/title/tt5838588/?ref_=adv_li_tt	2018	M	9.6	26,118	TRUE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
4	Thrown back into the dangerous underworld he’d tried to leave behind, Nathan Drake must decide what he’s willing to sacrifice to save the ones he loves.	Uncharted 4: A Thief’s End	https://www.imdb.com/title/tt3334704/?ref_=adv_li_tt	2016	T	9.5	28,722	TRUE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
5	Five years after the events of The Last of Us, Ellie embarks on another journey through a post-apocalyptic America on a mission of vengeance against a mysterious militia.	The Last of Us: Part II	https://www.imdb.com/title/tt6298000/?ref_=adv_li_tt	2020	M	8.5	30,460	TRUE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
6	Aloy treks into an arcane region and faces new hostile enemies and threats in search of a way to heal the world from a deadly blight and catastrophic storms.	Horizon Forbidden West	https://www.imdb.com/title/tt12496904/?ref_=adv_li_tt	2022	T	9.2	2,979	TRUE	TRUE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE
7	In a hostile, post-pandemic world, Joel and Ellie, brought together by desperate circumstances, must rely on each other to survive a brutal journey across what remains of the United States.	The Last of Us	https://www.imdb.com/title/tt2140553/?ref_=adv_li_tt	2013	M	9.7	60,590	TRUE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
8	Take control of three androids in their quest to discover who they really are.	Detroit: Become Human	https://www.imdb.com/title/tt5158314/?ref_=adv_li_tt	2018	M	9.2	16,907	TRUE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
9	Deliveryman Sam Porter must travel across a ravaged wasteland and reconnect the city states of America formed after a mysterious apocalyptic event dubbed ‘death stranding’ left the world in ruins and plagued by supernatural tar creatures.	Death Stranding	https://www.imdb.com/title/tt5807606/?ref_=adv_li_tt	2019	M	8.8	8,136	TRUE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
10	Set in 1274 on the Tsushima Island, the last samurai, Jin Sakai, must master a new fighting style, the way of the Ghost, to defeat the Mongol forces and fight for the freedom and independence of Japan.	Ghost of Tsushima	https://www.imdb.com/title/tt7651352/?ref_=adv_li_tt	2020	M	9.3	8,452	TRUE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
11	In this sequel of Marvel’s Spider-Man (2018), you can play as Miles Morales as a new and different Spider-Man while he learns some stories about his will of fighting crime and serving justice by his mentor and former hero, Peter Parker.	Spider-Man: Miles Morales	https://www.imdb.com/title/tt12496734/?ref_=adv_li_tt	2020	T	8.5	5,835	TRUE	TRUE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE
12	In Night City, a mercenary known as V navigates a dystopian society in which the line between humanity and technology becomes blurred.	Cyberpunk 2077	https://www.imdb.com/title/tt3810192/?ref_=adv_li_tt	2020	M	8.0	8,118	TRUE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE
13	Treasure hunter Nathan Drake, embarks in the adventure of his life searching for the legendary treasure, El Dorado while fighting a group of mercenaries.	Uncharted: Drake’s Fortune	https://www.imdb.com/title/tt1000777/?ref_=adv_li_tt	2007	T	8.5	20,343	TRUE	TRUE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE
14	With his back against the wall, Batman turns to his closest allies to help him save Gotham City from the clutches of Scarecrow and the Arkham Knight’s army. A familiar face also returns to give The Dark Knight a message he cannot ignore.	Batman: Arkham Knight	https://www.imdb.com/title/tt3554580/?ref_=adv_li_tt	2015	M	9.0	18,970	TRUE	TRUE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE

Tokens and Feature Matrix

Using Quanteda package to create corpus from the above data. Next, I create tokens and document-feature matrix. The resulting corpus is a static container of the texts from this video games data. Data from the corpus won’t be and can’t be used for cleaning/pre-processing through stemming etc. Rather, the corpus vdgames_corp will serve as a reference copy for further extraction to create new objects for required analysis on this video games data.

To segment the data from the corpus, I use tokens and tidy up the data. I also make a vector of some stop words that may be repeated in the data and not as relevant to the final result. vdgames_dfm is the resulting feature-matrix.

vdgames_corp <- corpus(vdgames_df)
summary(vdgames_corp, n = 5)

## Corpus consisting of 1000 documents, showing 5 documents:
## 
##  Text Types Tokens Sentences                       name
##     0    28     33         2                 Spider-Man
##     1    26     33         1     Red Dead Redemption II
##     2    25     28         1         Grand Theft Auto V
##     3    31     38         1                 God of War
##     4    25     28         1 Uncharted 4: A Thief's End
##                                                   url year certificate rating
##  https://www.imdb.com/title/tt5807780/?ref_=adv_li_tt 2018           T    9.2
##  https://www.imdb.com/title/tt6161168/?ref_=adv_li_tt 2018           M    9.7
##  https://www.imdb.com/title/tt2103188/?ref_=adv_li_tt 2013           M    9.5
##  https://www.imdb.com/title/tt5838588/?ref_=adv_li_tt 2018           M    9.6
##  https://www.imdb.com/title/tt3334704/?ref_=adv_li_tt 2016           T    9.5
##   votes Action Adventure Comedy Crime Family Fantasy Mystery Sci.Fi Thriller
##  20,759   TRUE      TRUE  FALSE FALSE  FALSE    TRUE   FALSE  FALSE    FALSE
##  35,703   TRUE      TRUE  FALSE  TRUE  FALSE   FALSE   FALSE  FALSE    FALSE
##  59,986   TRUE     FALSE  FALSE  TRUE  FALSE   FALSE   FALSE  FALSE    FALSE
##  26,118   TRUE      TRUE  FALSE FALSE  FALSE   FALSE   FALSE  FALSE    FALSE
##  28,722   TRUE      TRUE  FALSE FALSE  FALSE   FALSE   FALSE  FALSE    FALSE

vdgames_toks <- tokens(
  vdgames_corp,
  remove_punct = TRUE,
  remove_numbers = TRUE,
  remove_symbols = TRUE,
  remove_url = TRUE,
  split_hyphens = FALSE)


myStopWords = c("he", "when", "and", "in", "the", "by","to", "his", "is", "must",
                "over", "out", "during", "of", "can", "through", "so", "where","set",
                "based", "over", "than", "this", "new", "city", "fight", "save",
                "three", "world", "find", "take", "known","one", "help", "called",
                "summary", "story", "first", "stop", "worlds","game","player","play",
                "team","see","plot","players","follows","characters","help","two","â",               "friends", "series", "video")

vdgames_toks2 <- tokens_remove(
  vdgames_toks, pattern = c(stopwords("en"), myStopWords))

vdgames_dfm <- dfm(vdgames_toks2, tolower = TRUE) %>%
  dfm_remove(c(stopwords("en"), myStopWords)) %>%
  dfm_trim(min_termfreq = 5, min_docfreq = 10)

vdgames_dfm

## Document-feature matrix of: 1,000 documents, 248 features (98.30% sparse) and 15 docvars.
##     features
## docs york rise century gang way life different criminals lands son
##    0    1    1       0    0   0    0         0         0     0   0
##    1    0    0       1    1   1    1         0         0     0   0
##    2    0    0       0    0   0    0         1         1     0   0
##    3    0    0       0    0   0    0         0         0     1   1
##    4    0    0       0    0   0    0         0         0     0   0
##    5    0    0       0    0   0    0         0         0     0   0
## [ reached max_ndoc ... 994 more documents, reached max_nfeat ... 238 more features ]

Relevant keywords

Selecting top features by getting number of number of documents and features through ndoc() and nfeat(). Analyzing some relevant keywords using the function kwic() to show those keywords in context, with their neighbouring words. For my analysis, I have the keywords “quest”, “mysterious”, “planet”.

ndoc(vdgames_dfm)

## [1] 1000

nfeat(vdgames_dfm)

## [1] 248

topfeatures(vdgames_dfm, 30)

##        war      years       time       full mysterious       evil        way 
##         79         55         54         50         49         45         44 
##       back       dark      order      young     events     battle      power 
##         44         43         42         42         41         41         38 
##      group     across    control    journey       life     island     forces 
##         37         35         35         34         33         31         31 
##  adventure       land      takes      quest     planet   powerful     empire 
##         31         30         30         29         29         28         27 
##        now      earth 
##         27         27

keyword1 <- kwic(vdgames_toks2, pattern = phrase("quest*"), window = 2)
head(keyword1, 5)

## Keyword-in-context with 5 matches.                                                     
##     [8, 3] control androids | quest | discover really
##    [27, 7]       Drake goes | quest | Marco Polo's   
##   [31, 14]     warrior goes | quest | learn truth    
##    [59, 6]    Drake embarks | quest | search Atlantis
##  [148, 16]    Tony survives | quest | kill Sosa

keyword2 <- kwic(vdgames_toks2, pattern = phrase("mysterious*"), window = 2)
head(keyword2, 5)

## Keyword-in-context with 5 matches.                                                            
##   [5, 14] mission vengeance | mysterious | militia          
##   [9, 12]    America formed | mysterious | apocalyptic event
##  [31, 17]       learn truth | mysterious | origin state     
##  [48, 14]  ultimately leads | mysterious | village          
##  [72, 14]  GrabPack Explore | mysterious | facility get

keyword3 <- kwic(vdgames_toks, pattern = phrase("planet*"), window = 2)
head(keyword3, 5)

## Keyword-in-context with 5 matches.                                                    
##   [34, 31] protect the |  planet  | and all         
##  [109, 15]  beyond the |  planet  | Pandora and     
##  [146, 17]      in the | planets  | maximum security
##  [169, 15]     rid the |  planet  | of invading     
##   [170, 6]  enters the | planet's | orbit and

Feature co-occurrence matrix

The feature co-occurrence matrix (FCM) will record the number of co-occurrences of tokens, and I use the words which have occurred at least 35 times in my analysis. There are 18 such words and I visualize a semantic network analysis of those words using textplot_network

vdgames_dfm_small <- dfm_trim(vdgames_dfm, min_termfreq = 35)
nfeat(vdgames_dfm_small)

## [1] 17

vdgames_fcm <- fcm(vdgames_dfm_small)

feat <- names(topfeatures(vdgames_fcm, 30))
fcmat_select <- fcm_select(vdgames_fcm, pattern = feat, selection = "keep")


size <- log(colSums(dfm_select(vdgames_fcm, feat, selection = "keep")))
set.seed(123)
textplot_network(fcmat_select, min_freq = 0.5, edge_size = 2, edge_color="darkseagreen",
                 vertex_size = size/max(size)*3)

LDA Topic Modelling

Using Latent Dirichlet Allocation (LDA) for topic modelling to organize the above data into themes. Visualizing the top 5 keywords in each topic

vdgames_dtmat = quanteda::convert(vdgames_dfm, to="topicmodels")
vdgames_lda5 <- LDA(vdgames_dtmat, k = 5, control = list(seed = 123))


vdgames_lda5_betas <- broom::tidy(vdgames_lda5)

top_terms_in_topics <- vdgames_lda5_betas %>%
  group_by(topic) %>%
  top_n(5, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)
#top_terms_in_topics
kbl(head(top_terms_in_topics, n=10))%>%
  kable_paper(bootstrap_options = "striped", full_width=F)

topic	term	beta
1	mysterious	0.0258181
1	time	0.0255892
1	young	0.0212622
1	battle	0.0175506
1	evil	0.0166833
2	years	0.0238485
2	power	0.0219938
2	dark	0.0218196
2	empire	0.0181477
2	across	0.0169756

Plotting the top keywords in a topic

top_terms_in_topics %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_fill_brewer(palette="YlGn")+
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

test = subset(vdgames_df)
nrow(test)

## [1] 1000

Cross validating LDA Model

To ensure good model performance of this LDA model, it is better to verify the model performance based on unseen data through cross-validation to identify the best number of topics.

train_vdgames_dtmat <- corpus_subset(vdgames_corp) %>%
  tokens(remove_punct = TRUE, remove_numbers = TRUE,
         remove_symbols = TRUE, remove_url = TRUE) %>%
  dfm(tolower = TRUE) %>%
  dfm_remove(c(stopwords("en"), myStopWords)) %>%
  dfm_trim(min_termfreq = 5, min_docfreq = 10) %>%
  quanteda::convert(to="topicmodels")

test_vdgames_dtmat <- corpus_subset(vdgames_corp) %>%
  tokens(remove_punct = TRUE, remove_numbers = TRUE,
         remove_symbols = TRUE, remove_url = TRUE) %>%
  dfm(tolower = TRUE) %>%
  dfm_remove(c(stopwords("en"), myStopWords)) %>%
  dfm_trim(min_termfreq = 5, min_docfreq = 10) %>%
  quanteda::convert(to="topicmodels")

Testing the Perplexity

Perplexity is a common metric used to evaluate NLP and language models and for a better fit model we aim for a lower perplexity. Here, I am testing perplexity when k=5 and plotting the result

train_vdgames_lda5 <- LDA(train_vdgames_dtmat, k = 5, control = list(seed = 123))
perplexity(train_vdgames_lda5, test_vdgames_dtmat)

## [1] 223.7036

n_topics_vec = 2:5
perplexity_vec = map_dbl(n_topics_vec, function(kk) {
  message(kk)
  train_vdgames_ldaK <- LDA(train_vdgames_dtmat, k = kk, control = list(seed = 123))
  perp = perplexity(train_vdgames_ldaK, test_vdgames_dtmat)
})

## 2

## 3

## 4

## 5

lda_perplexity_result = tibble(
  n_topics = n_topics_vec, perplexity = perplexity_vec
)
plot(lda_perplexity_result, type="l")

Finding best number of topics using ldatuning

Using ldatuning to find the best number of topics based on the “CaoJuan2009”,“Arun2010”, and “Deveaud2014” measures as per requirement.

lda_ldatuning_result <- FindTopicsNumber(
  vdgames_dtmat, topics = n_topics_vec,
  metrics = c("CaoJuan2009", "Arun2010", "Deveaud2014"),  method = "VEM", 
  control = list(seed = 123), mc.cores = 4L, verbose = TRUE)

## fit models... done.
## calculate metrics:
##   CaoJuan2009... done.
##   Arun2010... done.
##   Deveaud2014... done.

FindTopicsNumber_plot(lda_ldatuning_result)

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

Based on one or more of the metrics, a vote between all values of k is chosen. As CaoJuan2009, Arun2010 are minimizers, we would want the lower values. Arun2010 has a low value at k=5 and Caojuan2009 has a low value at k=2. Deveaud2014 is a miaximiser and we would want a higher value. The highest value occurs at k=5. Thus, observation from the above graph is that the 5-topic LDA performs the best. Fitting the resulting LDA model and showing topic-specific diagnostics using topicdoc package.

vdgames_lda3 <- LDA(vdgames_dtmat, k = 3, control = list(seed = 123))
topicdoc_result = topic_diagnostics(vdgames_lda3, vdgames_dtmat)
#view(topicdoc_result)
kbl(head(topicdoc_result, n=10))%>%
  kable_paper(bootstrap_options = "striped", full_width=F)

topic_num	topic_size	mean_token_length	dist_from_corpus	tf_df_dist	doc_prominence	topic_coherence	topic_exclusivity
1	82.55511	5.3	0.2354626	0.5653146	985	-210.2222	9.350212
2	85.13868	5.5	0.2406158	0.3303270	985	-216.2282	8.981297
3	80.30621	5.2	0.2299487	0.3914156	985	-217.5405	9.398013

Fitting a Structure Topic Model

Using STM for improving inference and qualitative interpretability by affecting topical prevalence, topic content, or both of the abve video games data. Exploring the result by LDAvis package

stm_vdgamesdfmat <- quanteda::convert(vdgames_dfm, to = "stm")

## Warning in dfm2stm(x, docvars, omit_empty = TRUE): Dropped empty document(s):
## 23, 213, 226, 310, 376, 439, 454, 635, 731, 780, 787, 825, 854, 935, 959

out <- prepDocuments( stm_vdgamesdfmat$documents, 
                      stm_vdgamesdfmat$vocab, 
                      stm_vdgamesdfmat$meta)

vdgames_tmob_stm <- stm(out$documents, out$vocab,K=3,
                         seed=123,emtol=1e-3, max.em.its=150)

## Beginning Spectral Initialization 
##   Calculating the gram matrix...
##   Finding anchor words...
##      ...
##   Recovering initialization...
##      ..
## Initialization complete.
## .............................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 1 (approx. per word bound = -5.661) 
## .............................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 2 (approx. per word bound = -5.518, relative change = 2.539e-02) 
## .............................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 3 (approx. per word bound = -5.466, relative change = 9.440e-03) 
## .............................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 4 (approx. per word bound = -5.439, relative change = 4.850e-03) 
## .............................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 5 (approx. per word bound = -5.424, relative change = 2.723e-03) 
## Topic 1: years, way, back, dark, battle 
##  Topic 2: young, group, journey, island, power 
##  Topic 3: war, time, evil, events, life 
## .............................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 6 (approx. per word bound = -5.415, relative change = 1.658e-03) 
## .............................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 7 (approx. per word bound = -5.409, relative change = 1.080e-03) 
## .............................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Model Converged

toLDAvis(mod=vdgames_tmob_stm, docs=out$documents)

## Loading required namespace: servr

plot(vdgames_tmob_stm, type="summary", n=5)

Comparing topic quality

topicQuality(vdgames_tmob_stm, out$documents)

## [1] -172.9114 -228.3673 -198.8188
## [1] 8.885934 8.680275 8.422684

keyATM_docs <- keyATM_read(texts = vdgames_dfm)

## Using quanteda dfm.

## Warning in get_doc_index(W_raw, check = TRUE): Number of documents with 0 length: 15
## This may cause invalid covariates or time index.
## Please review the preprocessing steps.
## Document index to check: 24, 214, 227, 311, 377, 440, 455, 636, 732, 781, 788, 826, 855, 936, 960

summary(keyATM_docs)

## keyATM_docs object of: 1000 documents.
## Length of documents:
##   Avg: 4.363
##   Min: 0
##   Max: 12
##    SD: 2.322
## Number of unique words: 2123

Comparing topic qualities, Topic 2 has medium Exclusivity and high Semantic Coherence. Topic 2 has words like “war”, “battle”,“army”,“military” which shows that most popular video games are Action-based where the plots are centered around fighting, combat, battles and similar themes.

Top 5 keywords
1_dark	2_war	3_mysterious	Other_1	Other_2	Other_3
young [<U+2713>]	war	time [<U+2713>]	duty	add	marvel
land [<U+2713>]	years [<U+2713>]	back [<U+2713>]	call	york	universe
power [<U+2713>]	mysterious	evil [<U+2713>]	star	rise	heroes
king	full	control [<U+2713>]	wars	century	villains
journey [<U+2713>]	way [<U+2713>]	mario	battle [2]	gang	order [2]

Conclusion

Thus, this project can be useful to interpret similar results out of descriptions of video games, their plots and which themes are more popular. It is also possible to use a similar textual exploratory analysis on other products.