Asssignment2-Misra-VideoGamesPlots

Shrawani Misra

10/09/2022

Plots of Latest Popular Video Games: Text Data Analysis

In this project, data from latest video games popular according to IMDB is available, which explores their plots and categories. There are 20000 video game titles available with their corresponding genres, certificate information, vote count on IMDB and plot summary.

The genres are Action, Adventure, Comedy, Crime, Family, Fantasy, Mystery, Sci-fi, Thriller. The plot summary gives an outline of how the story of the game progresses and is an important factor in the popularity of a video game. This project can be useful to gain insights into the trends of game genre popularity. Through an analysis on the textual data of the plots, some questions can be answered useful to generate any catchy game titles/plots for future sales.

Project Coding

Loading the libraries required in the project

library(ldatuning)
## Warning: package 'ldatuning' was built under R version 4.1.3
library(stm)
## Warning: package 'stm' was built under R version 4.1.3
## stm v1.3.6 successfully loaded. See ?stm for help. 
##  Papers, resources, and other materials at structuraltopicmodel.com
library(keyATM)
## Warning: package 'keyATM' was built under R version 4.1.3
## keyATM 0.4.1 successfully loaded.
##  Papers, examples, resources, and other materials are at
##  https://keyatm.github.io/keyATM/
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.1.3
library(topicmodels)
## Warning: package 'topicmodels' was built under R version 4.1.3
library(quanteda)
## Warning: package 'quanteda' was built under R version 4.1.3
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "packedMatrix" of class "mMatrix"; definition not updated
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "packedMatrix" of class "replValueSp"; definition not updated
## Package version: 3.2.3
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.9
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.3
## Warning: package 'purrr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter()     masks stats::filter()
## x dplyr::group_rows() masks kableExtra::group_rows()
## x dplyr::lag()        masks stats::lag()
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.1.3
library(seededlda)
## Warning: package 'seededlda' was built under R version 4.1.3
## Loading required package: proxyC
## Warning: package 'proxyC' was built under R version 4.1.3
## 
## Attaching package: 'proxyC'
## 
## The following object is masked from 'package:stats':
## 
##     dist
## 
## 
## Attaching package: 'seededlda'
## 
## The following objects are masked from 'package:topicmodels':
## 
##     terms, topics
## 
## The following object is masked from 'package:stats':
## 
##     terms
library(topicdoc)
## Warning: package 'topicdoc' was built under R version 4.1.3
library(LDAvis)
## Warning: package 'LDAvis' was built under R version 4.1.3
library(quanteda.textplots)
## Warning: package 'quanteda.textplots' was built under R version 4.1.3
library(broom)
## Warning: package 'broom' was built under R version 4.1.3
library(ggplot2)

Reading the text from the data

Limiting the data to 1000 rows, as per stated requirements, and writing the resulting data in a csv file. Proceeding to read the textual data using the functional readtext to directly extract document-level variables from the data. The columns of my choice for this project are “plot” which gives the plot summary of the video game and a unique identifier docid field.

setwd("C:/CIS8045/data")
vdgames_df = read.csv("imdb-videogames.csv",nrows = 1000)

write.csv(vdgames_df,"C:/CIS8045/data/Assignment2.csv", row.names = FALSE)

vdgames_df <- readtext::readtext("Assignment2.csv", 
                                 text_field = "plot", docid_field = "X")
#glimpse(vdgames_df)
kbl(head(vdgames_df, n=15))%>%
  kable_paper(bootstrap_options = "striped", full_width=F)
doc_id text name url year certificate rating votes Action Adventure Comedy Crime Family Fantasy Mystery Sci.Fi Thriller
0 When a new villain threatens New York City, Peter Parker and Spider-Man’s worlds collide. To save the city and those he loves, he must rise up and be greater. Spider-Man https://www.imdb.com/title/tt5807780/?ref_=adv_li_tt 2018 T 9.2 20,759 TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
1 Amidst the decline of the Wild West at the turn of the 20th century, outlaw Arthur Morgan and his gang struggle to cope with the loss of their way of life. Red Dead Redemption II https://www.imdb.com/title/tt6161168/?ref_=adv_li_tt 2018 M 9.7 35,703 TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
2 Three very different criminals team up for a series of heists and walk into some of the most thrilling experiences in the corrupt city of Los Santos. Grand Theft Auto V https://www.imdb.com/title/tt2103188/?ref_=adv_li_tt 2013 M 9.5 59,986 TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
3 After wiping out the gods of Mount Olympus, Kratos moves on to the frigid lands of Scandinavia, where he and his son must embark on an odyssey across a dangerous world of gods and monsters. God of War https://www.imdb.com/title/tt5838588/?ref_=adv_li_tt 2018 M 9.6 26,118 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
4 Thrown back into the dangerous underworld he’d tried to leave behind, Nathan Drake must decide what he’s willing to sacrifice to save the ones he loves. Uncharted 4: A Thief’s End https://www.imdb.com/title/tt3334704/?ref_=adv_li_tt 2016 T 9.5 28,722 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
5 Five years after the events of The Last of Us, Ellie embarks on another journey through a post-apocalyptic America on a mission of vengeance against a mysterious militia. The Last of Us: Part II https://www.imdb.com/title/tt6298000/?ref_=adv_li_tt 2020 M 8.5 30,460 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
6 Aloy treks into an arcane region and faces new hostile enemies and threats in search of a way to heal the world from a deadly blight and catastrophic storms. Horizon Forbidden West https://www.imdb.com/title/tt12496904/?ref_=adv_li_tt 2022 T 9.2 2,979 TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
7 In a hostile, post-pandemic world, Joel and Ellie, brought together by desperate circumstances, must rely on each other to survive a brutal journey across what remains of the United States. The Last of Us https://www.imdb.com/title/tt2140553/?ref_=adv_li_tt 2013 M 9.7 60,590 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
8 Take control of three androids in their quest to discover who they really are. Detroit: Become Human https://www.imdb.com/title/tt5158314/?ref_=adv_li_tt 2018 M 9.2 16,907 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
9 Deliveryman Sam Porter must travel across a ravaged wasteland and reconnect the city states of America formed after a mysterious apocalyptic event dubbed ‘death stranding’ left the world in ruins and plagued by supernatural tar creatures. Death Stranding https://www.imdb.com/title/tt5807606/?ref_=adv_li_tt 2019 M 8.8 8,136 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
10 Set in 1274 on the Tsushima Island, the last samurai, Jin Sakai, must master a new fighting style, the way of the Ghost, to defeat the Mongol forces and fight for the freedom and independence of Japan. Ghost of Tsushima https://www.imdb.com/title/tt7651352/?ref_=adv_li_tt 2020 M 9.3 8,452 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
11 In this sequel of Marvel’s Spider-Man (2018), you can play as Miles Morales as a new and different Spider-Man while he learns some stories about his will of fighting crime and serving justice by his mentor and former hero, Peter Parker. Spider-Man: Miles Morales https://www.imdb.com/title/tt12496734/?ref_=adv_li_tt 2020 T 8.5 5,835 TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
12 In Night City, a mercenary known as V navigates a dystopian society in which the line between humanity and technology becomes blurred. Cyberpunk 2077 https://www.imdb.com/title/tt3810192/?ref_=adv_li_tt 2020 M 8.0 8,118 TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
13 Treasure hunter Nathan Drake, embarks in the adventure of his life searching for the legendary treasure, El Dorado while fighting a group of mercenaries. Uncharted: Drake’s Fortune https://www.imdb.com/title/tt1000777/?ref_=adv_li_tt 2007 T 8.5 20,343 TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
14 With his back against the wall, Batman turns to his closest allies to help him save Gotham City from the clutches of Scarecrow and the Arkham Knight’s army. A familiar face also returns to give The Dark Knight a message he cannot ignore. Batman: Arkham Knight https://www.imdb.com/title/tt3554580/?ref_=adv_li_tt 2015 M 9.0 18,970 TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE

Tokens and Feature Matrix

Using Quanteda package to create corpus from the above data. Next, I create tokens and document-feature matrix. The resulting corpus is a static container of the texts from this video games data. Data from the corpus won’t be and can’t be used for cleaning/pre-processing through stemming etc. Rather, the corpus vdgames_corp will serve as a reference copy for further extraction to create new objects for required analysis on this video games data.

To segment the data from the corpus, I use tokens and tidy up the data. I also make a vector of some stop words that may be repeated in the data and not as relevant to the final result. vdgames_dfm is the resulting feature-matrix.

vdgames_corp <- corpus(vdgames_df)
summary(vdgames_corp, n = 5)
## Corpus consisting of 1000 documents, showing 5 documents:
## 
##  Text Types Tokens Sentences                       name
##     0    28     33         2                 Spider-Man
##     1    26     33         1     Red Dead Redemption II
##     2    25     28         1         Grand Theft Auto V
##     3    31     38         1                 God of War
##     4    25     28         1 Uncharted 4: A Thief's End
##                                                   url year certificate rating
##  https://www.imdb.com/title/tt5807780/?ref_=adv_li_tt 2018           T    9.2
##  https://www.imdb.com/title/tt6161168/?ref_=adv_li_tt 2018           M    9.7
##  https://www.imdb.com/title/tt2103188/?ref_=adv_li_tt 2013           M    9.5
##  https://www.imdb.com/title/tt5838588/?ref_=adv_li_tt 2018           M    9.6
##  https://www.imdb.com/title/tt3334704/?ref_=adv_li_tt 2016           T    9.5
##   votes Action Adventure Comedy Crime Family Fantasy Mystery Sci.Fi Thriller
##  20,759   TRUE      TRUE  FALSE FALSE  FALSE    TRUE   FALSE  FALSE    FALSE
##  35,703   TRUE      TRUE  FALSE  TRUE  FALSE   FALSE   FALSE  FALSE    FALSE
##  59,986   TRUE     FALSE  FALSE  TRUE  FALSE   FALSE   FALSE  FALSE    FALSE
##  26,118   TRUE      TRUE  FALSE FALSE  FALSE   FALSE   FALSE  FALSE    FALSE
##  28,722   TRUE      TRUE  FALSE FALSE  FALSE   FALSE   FALSE  FALSE    FALSE
vdgames_toks <- tokens(
  vdgames_corp,
  remove_punct = TRUE,
  remove_numbers = TRUE,
  remove_symbols = TRUE,
  remove_url = TRUE,
  split_hyphens = FALSE)


myStopWords = c("he", "when", "and", "in", "the", "by","to", "his", "is", "must",
                "over", "out", "during", "of", "can", "through", "so", "where","set",
                "based", "over", "than", "this", "new", "city", "fight", "save",
                "three", "world", "find", "take", "known","one", "help", "called",
                "summary", "story", "first", "stop", "worlds","game","player","play",
                "team","see","plot","players","follows","characters","help","two","â",               "friends", "series", "video")

vdgames_toks2 <- tokens_remove(
  vdgames_toks, pattern = c(stopwords("en"), myStopWords))

vdgames_dfm <- dfm(vdgames_toks2, tolower = TRUE) %>%
  dfm_remove(c(stopwords("en"), myStopWords)) %>%
  dfm_trim(min_termfreq = 5, min_docfreq = 10)

vdgames_dfm 
## Document-feature matrix of: 1,000 documents, 248 features (98.30% sparse) and 15 docvars.
##     features
## docs york rise century gang way life different criminals lands son
##    0    1    1       0    0   0    0         0         0     0   0
##    1    0    0       1    1   1    1         0         0     0   0
##    2    0    0       0    0   0    0         1         1     0   0
##    3    0    0       0    0   0    0         0         0     1   1
##    4    0    0       0    0   0    0         0         0     0   0
##    5    0    0       0    0   0    0         0         0     0   0
## [ reached max_ndoc ... 994 more documents, reached max_nfeat ... 238 more features ]

Relevant keywords

Selecting top features by getting number of number of documents and features through ndoc() and nfeat(). Analyzing some relevant keywords using the function kwic() to show those keywords in context, with their neighbouring words. For my analysis, I have the keywords “quest”, “mysterious”, “planet”.

ndoc(vdgames_dfm)
## [1] 1000
nfeat(vdgames_dfm)
## [1] 248
topfeatures(vdgames_dfm, 30)
##        war      years       time       full mysterious       evil        way 
##         79         55         54         50         49         45         44 
##       back       dark      order      young     events     battle      power 
##         44         43         42         42         41         41         38 
##      group     across    control    journey       life     island     forces 
##         37         35         35         34         33         31         31 
##  adventure       land      takes      quest     planet   powerful     empire 
##         31         30         30         29         29         28         27 
##        now      earth 
##         27         27
keyword1 <- kwic(vdgames_toks2, pattern = phrase("quest*"), window = 2)
head(keyword1, 5)
## Keyword-in-context with 5 matches.                                                     
##     [8, 3] control androids | quest | discover really
##    [27, 7]       Drake goes | quest | Marco Polo's   
##   [31, 14]     warrior goes | quest | learn truth    
##    [59, 6]    Drake embarks | quest | search Atlantis
##  [148, 16]    Tony survives | quest | kill Sosa
keyword2 <- kwic(vdgames_toks2, pattern = phrase("mysterious*"), window = 2)
head(keyword2, 5)
## Keyword-in-context with 5 matches.                                                            
##   [5, 14] mission vengeance | mysterious | militia          
##   [9, 12]    America formed | mysterious | apocalyptic event
##  [31, 17]       learn truth | mysterious | origin state     
##  [48, 14]  ultimately leads | mysterious | village          
##  [72, 14]  GrabPack Explore | mysterious | facility get
keyword3 <- kwic(vdgames_toks, pattern = phrase("planet*"), window = 2)
head(keyword3, 5)
## Keyword-in-context with 5 matches.                                                    
##   [34, 31] protect the |  planet  | and all         
##  [109, 15]  beyond the |  planet  | Pandora and     
##  [146, 17]      in the | planets  | maximum security
##  [169, 15]     rid the |  planet  | of invading     
##   [170, 6]  enters the | planet's | orbit and

Feature co-occurrence matrix

The feature co-occurrence matrix (FCM) will record the number of co-occurrences of tokens, and I use the words which have occurred at least 35 times in my analysis. There are 18 such words and I visualize a semantic network analysis of those words using textplot_network

vdgames_dfm_small <- dfm_trim(vdgames_dfm, min_termfreq = 35)
nfeat(vdgames_dfm_small)
## [1] 17
vdgames_fcm <- fcm(vdgames_dfm_small)

feat <- names(topfeatures(vdgames_fcm, 30))
fcmat_select <- fcm_select(vdgames_fcm, pattern = feat, selection = "keep")


size <- log(colSums(dfm_select(vdgames_fcm, feat, selection = "keep")))
set.seed(123)
textplot_network(fcmat_select, min_freq = 0.5, edge_size = 2, edge_color="darkseagreen",
                 vertex_size = size/max(size)*3)

LDA Topic Modelling

Using Latent Dirichlet Allocation (LDA) for topic modelling to organize the above data into themes. Visualizing the top 5 keywords in each topic

vdgames_dtmat = quanteda::convert(vdgames_dfm, to="topicmodels")
vdgames_lda5 <- LDA(vdgames_dtmat, k = 5, control = list(seed = 123))


vdgames_lda5_betas <- broom::tidy(vdgames_lda5)

top_terms_in_topics <- vdgames_lda5_betas %>%
  group_by(topic) %>%
  top_n(5, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)
#top_terms_in_topics
kbl(head(top_terms_in_topics, n=10))%>%
  kable_paper(bootstrap_options = "striped", full_width=F)
topic term beta
1 mysterious 0.0258181
1 time 0.0255892
1 young 0.0212622
1 battle 0.0175506
1 evil 0.0166833
2 years 0.0238485
2 power 0.0219938
2 dark 0.0218196
2 empire 0.0181477
2 across 0.0169756

Plotting the top keywords in a topic

top_terms_in_topics %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_fill_brewer(palette="YlGn")+
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

test = subset(vdgames_df)
nrow(test)
## [1] 1000

Cross validating LDA Model

To ensure good model performance of this LDA model, it is better to verify the model performance based on unseen data through cross-validation to identify the best number of topics.

train_vdgames_dtmat <- corpus_subset(vdgames_corp) %>%
  tokens(remove_punct = TRUE, remove_numbers = TRUE,
         remove_symbols = TRUE, remove_url = TRUE) %>%
  dfm(tolower = TRUE) %>%
  dfm_remove(c(stopwords("en"), myStopWords)) %>%
  dfm_trim(min_termfreq = 5, min_docfreq = 10) %>%
  quanteda::convert(to="topicmodels")

test_vdgames_dtmat <- corpus_subset(vdgames_corp) %>%
  tokens(remove_punct = TRUE, remove_numbers = TRUE,
         remove_symbols = TRUE, remove_url = TRUE) %>%
  dfm(tolower = TRUE) %>%
  dfm_remove(c(stopwords("en"), myStopWords)) %>%
  dfm_trim(min_termfreq = 5, min_docfreq = 10) %>%
  quanteda::convert(to="topicmodels")

Testing the Perplexity

Perplexity is a common metric used to evaluate NLP and language models and for a better fit model we aim for a lower perplexity. Here, I am testing perplexity when k=5 and plotting the result

train_vdgames_lda5 <- LDA(train_vdgames_dtmat, k = 5, control = list(seed = 123))
perplexity(train_vdgames_lda5, test_vdgames_dtmat)
## [1] 223.7036
n_topics_vec = 2:5
perplexity_vec = map_dbl(n_topics_vec, function(kk) {
  message(kk)
  train_vdgames_ldaK <- LDA(train_vdgames_dtmat, k = kk, control = list(seed = 123))
  perp = perplexity(train_vdgames_ldaK, test_vdgames_dtmat)
})
## 2
## 3
## 4
## 5
lda_perplexity_result = tibble(
  n_topics = n_topics_vec, perplexity = perplexity_vec
)
plot(lda_perplexity_result, type="l")

Finding best number of topics using ldatuning

Using ldatuning to find the best number of topics based on the “CaoJuan2009”,“Arun2010”, and “Deveaud2014” measures as per requirement.

lda_ldatuning_result <- FindTopicsNumber(
  vdgames_dtmat, topics = n_topics_vec,
  metrics = c("CaoJuan2009", "Arun2010", "Deveaud2014"),  method = "VEM", 
  control = list(seed = 123), mc.cores = 4L, verbose = TRUE)
## fit models... done.
## calculate metrics:
##   CaoJuan2009... done.
##   Arun2010... done.
##   Deveaud2014... done.
FindTopicsNumber_plot(lda_ldatuning_result)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

Based on one or more of the metrics, a vote between all values of k is chosen. As CaoJuan2009, Arun2010 are minimizers, we would want the lower values. Arun2010 has a low value at k=5 and Caojuan2009 has a low value at k=2. Deveaud2014 is a miaximiser and we would want a higher value. The highest value occurs at k=5. Thus, observation from the above graph is that the 5-topic LDA performs the best. Fitting the resulting LDA model and showing topic-specific diagnostics using topicdoc package.

vdgames_lda3 <- LDA(vdgames_dtmat, k = 3, control = list(seed = 123))
topicdoc_result = topic_diagnostics(vdgames_lda3, vdgames_dtmat)
#view(topicdoc_result)
kbl(head(topicdoc_result, n=10))%>%
  kable_paper(bootstrap_options = "striped", full_width=F)
topic_num topic_size mean_token_length dist_from_corpus tf_df_dist doc_prominence topic_coherence topic_exclusivity
1 82.55511 5.3 0.2354626 0.5653146 985 -210.2222 9.350212
2 85.13868 5.5 0.2406158 0.3303270 985 -216.2282 8.981297
3 80.30621 5.2 0.2299487 0.3914156 985 -217.5405 9.398013

Fitting a Structure Topic Model

Using STM for improving inference and qualitative interpretability by affecting topical prevalence, topic content, or both of the abve video games data. Exploring the result by LDAvis package

stm_vdgamesdfmat <- quanteda::convert(vdgames_dfm, to = "stm")
## Warning in dfm2stm(x, docvars, omit_empty = TRUE): Dropped empty document(s):
## 23, 213, 226, 310, 376, 439, 454, 635, 731, 780, 787, 825, 854, 935, 959
out <- prepDocuments( stm_vdgamesdfmat$documents, 
                      stm_vdgamesdfmat$vocab, 
                      stm_vdgamesdfmat$meta)

vdgames_tmob_stm <- stm(out$documents, out$vocab,K=3,
                         seed=123,emtol=1e-3, max.em.its=150)
## Beginning Spectral Initialization 
##   Calculating the gram matrix...
##   Finding anchor words...
##      ...
##   Recovering initialization...
##      ..
## Initialization complete.
## .............................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 1 (approx. per word bound = -5.661) 
## .............................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 2 (approx. per word bound = -5.518, relative change = 2.539e-02) 
## .............................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 3 (approx. per word bound = -5.466, relative change = 9.440e-03) 
## .............................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 4 (approx. per word bound = -5.439, relative change = 4.850e-03) 
## .............................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 5 (approx. per word bound = -5.424, relative change = 2.723e-03) 
## Topic 1: years, way, back, dark, battle 
##  Topic 2: young, group, journey, island, power 
##  Topic 3: war, time, evil, events, life 
## .............................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 6 (approx. per word bound = -5.415, relative change = 1.658e-03) 
## .............................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 7 (approx. per word bound = -5.409, relative change = 1.080e-03) 
## .............................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Model Converged
toLDAvis(mod=vdgames_tmob_stm, docs=out$documents)
## Loading required namespace: servr
plot(vdgames_tmob_stm, type="summary", n=5)

Comparing topic quality

topicQuality(vdgames_tmob_stm, out$documents)
## [1] -172.9114 -228.3673 -198.8188
## [1] 8.885934 8.680275 8.422684

keyATM_docs <- keyATM_read(texts = vdgames_dfm)
## Using quanteda dfm.
## Warning in get_doc_index(W_raw, check = TRUE): Number of documents with 0 length: 15
## This may cause invalid covariates or time index.
## Please review the preprocessing steps.
## Document index to check: 24, 214, 227, 311, 377, 440, 455, 636, 732, 781, 788, 826, 855, 936, 960
summary(keyATM_docs)
## keyATM_docs object of: 1000 documents.
## Length of documents:
##   Avg: 4.363
##   Min: 0
##   Max: 12
##    SD: 2.322
## Number of unique words: 2123

Comparing topic qualities, Topic 2 has medium Exclusivity and high Semantic Coherence. Topic 2 has words like “war”, “battle”,“army”,“military” which shows that most popular video games are Action-based where the plots are centered around fighting, combat, battles and similar themes.

Top key words from topics

Visualizing keywords associated with 3 topics and fitting a keyATM base model. Selecting 3 topics and visualizing 5 words from those topics

vdgames_key_list = list(
  dark = c("young", "power", "journey", "land", "earth"),
  war = c("years","order","battle", "way","group"),
  mysterious=c("time","evil","back","control","island")
  
)
vdgames_key_viz <- visualize_keywords(docs = keyATM_docs, keywords = vdgames_key_list)
vdgames_key_viz

vdgames_tmod_keyatm_base <- keyATM(
  docs = keyATM_docs, 
  no_keyword_topics = 3, 
  keywords = vdgames_key_list, 
  model = "base", 
  options = list(seed = 123))
## Warning in keyATM_fit(docs, model, no_keyword_topics, keywords,
## model_settings, : Some documents have 0 length. Please review the preprocessing
## steps.
## Initializing the model...
## Fitting the model. 1500 iterations...
## Creating an output object. It may take time...
top_words(vdgames_tmod_keyatm_base, 5)
##               1_dark            2_war       3_mysterious    Other_1 Other_2
## 1   young [<U+2713>]              war    time [<U+2713>]       duty     add
## 2    land [<U+2713>] years [<U+2713>]    back [<U+2713>]       call    york
## 3   power [<U+2713>]       mysterious    evil [<U+2713>]       star    rise
## 4               king             full control [<U+2713>]       wars century
## 5 journey [<U+2713>]   way [<U+2713>]              mario battle [2]    gang
##     Other_3
## 1    marvel
## 2  universe
## 3    heroes
## 4  villains
## 5 order [2]
kable(top_words(vdgames_tmod_keyatm_base, 5), caption = "Top 5 keywords")
Top 5 keywords
1_dark 2_war 3_mysterious Other_1 Other_2 Other_3
young [<U+2713>] war time [<U+2713>] duty add marvel
land [<U+2713>] years [<U+2713>] back [<U+2713>] call york universe
power [<U+2713>] mysterious evil [<U+2713>] star rise heroes
king full control [<U+2713>] wars century villains
journey [<U+2713>] way [<U+2713>] mario battle [2] gang order [2]

Conclusion

Thus, this project can be useful to interpret similar results out of descriptions of video games, their plots and which themes are more popular. It is also possible to use a similar textual exploratory analysis on other products.