R na análise literária em português

Códigos em R utilizados para análise literária à distância em língua portuguesa (Giménez, 2024). Este documento atualiza e amplia uma versão anterior desenvolvida por Diego Giménez e Andressa Gomide em 2022, que focava na análise do “Livro do Desassossego”. Nesta nova versão, são analisadas obras de Machado de Assis, incluindo “Dom Casmurro”, “A mão e a luva”, “Memórias Póstumas de Brás Cubas” e “Quincas Borba”.

1 Ferramentas e preparação dos dados

1.1 Instalação

Quanteda (Quantitative Analysis of Textual Data) é um pacote de R para a manipulação e análise de dados textuais.

A instalação do R varia de acordo com o sistema operacional (ex.: Windows, Mac, Linux), bem como suas diferentes versões. Há várias fontes onde se pode obter instruções atualizadas de como instalar o R (ex.: https://didatica.tech/como-instalar-a-linguagem-r-e-o-rstudio/). O Comprehensive R Archive Network (CRAN), a rede oficial de distribuição do R, oferece instruções confiáveis para tal, porém, talvez não tão detalhadas como em outras fontes.

Outra sugestão é instalar uma interface gráfica do utilizador, do inglês Graphical User Interface (GUI). As GUIs facilitam consideravelmente a interação do usuário com o computador. O (RStudio) é a GUI mais utilizada para R e, assim como o R, é gratuita e possui código aberto.

1.2 Configuração: preparando o ambiente.

Ao reutilizar códigos, é uma boa prática estar atento à versão instalada tanto do R quanto das bibliotecas utilizadas. Não é necessário que as versões sejam as mesmas daquelas utilizadas durante a criação dos códigos, entretanto, em alguns casos, pode não haver compatibilidade entre versões diferentes e algumas funções ou pacotes podem ter sido descontinuados. Este artigo foi escrito utilizando a versão 4.3.3 do R.

# Verificar a versão do R

R.version.string

## [1] "R version 4.3.3 (2024-02-29 ucrt)"

Para nossa análise, utilizaremos alguns pacotes já existentes. Estes pacotes nada mais são que extensões para o R que normalmente contêm dados ou códigos. Para utilizá-los, precisamos instalá-los no computador, caso ainda não tenha sido feito, e carregá-los no R. Uma vantagem de carregar apenas os pacotes necessários (em vez de todos os pacotes instalados) é evitar processamento computacional desnecessário. O código abaixo cria uma lista dos pacotes utilizados na presente análise e os carrega, instalando os que ainda não estavam presentes.

# Listamos os pacotes que precisamos

packages = c("quanteda", # análise quantitativa de dados textuais
             "quanteda.textmodels", # complementa o quanteda, fornecendo funcionalidades específicas para modelagem de texto.
             "quanteda.textstats", # Este pacote contém funções para calcular estatísticas descritivas e medidas de complexidade de texto, como a diversidade lexical e a densidade lexical.
             "quanteda.textplots", # Este pacote oferece ferramentas para visualização de dados textuais, incluindo gráficos de dispersão de palavras, nuvens de palavras e mapas de calor.

             "newsmap", # para classificar documentos, com base em “seed words”, ou seja, palavras-chave pré-definidas que indicam tópicos ou categorias.
             "readtext", # para ler diferentes formatos de texto 
             "spacyr", # para anotação de classes gramaticais, reconhecimento de entidades e anotação sintática (python deve estar instalado)
             "ggplot2", #para gráfico simples das frequências
             "seededlda", # para modelagem de tópico
             "stringr", # para as expressões regulares
             "dplyr",  # Este pacote é parte do tidyverse e oferece um conjunto de funções para manipulação de dados tabulares em R, permitindo realizar operações como filtragem, seleção, agregação e junção de dados de forma simples e eficiente.
             "tidytext", #Este pacote complementa o tidyverse, fornecendo ferramentas para análise de texto em conjunto com os princípios de organização de dados do tidyverse, permitindo integrar facilmente análises de texto em pipelines de análise de dados.
             "knitr", #Este pacote é utilizado para produção de documentos dinâmicos em R, permitindo integrar código R e resultados de análises em documentos Markdown, HTML, PDF e outros formatos.
             "stringr", # Este pacote fornece funções para manipulação de strings em R, facilitando tarefas como a correspondência de padrões, a extração de substrings e a manipulação de texto.
             "igraph", #Este pacote é utilizado para análise e visualização de redes em R, oferecendo funções para criar, manipular e representar grafos e redes complexas. 
             "topicmodels" #Este pacote é utilizado para modelagem de tópicos em textos, oferecendo implementações de algoritmos como LDA (Latent Dirichlet Allocation) e LSA (Latent Semantic Analysis) para inferência de tópicos em coleções de documentos.
             )

# Instalamos (se necessário) e carregamos os pacotes

package.check <- lapply(
  packages,
  FUN = function(x) {
    if (!require(x, character.only = TRUE)) {
      install.packages(x, dependencies = TRUE)
      require(x, character.only = TRUE)
    }
  }
)

Os códigos abaixo foram implementados na versão 4.0.2 do Quanteda. Utilizar uma versão diferente pode resultar em erros ou resultados indesejados. Para verificar a versão dos pacotes, empregamos a função ‘packageVersion’. Para verificar a versão do R, utilizamos ‘R.version.string’.

# Verificar versão do quanteada

packageVersion("quanteda")

## [1] '4.0.2'

Por fim, precisamos estabelecer qual será nosso diretório de trabalho. Este será o local onde os resultados serão salvos. Para identificar qual é o diretório de trabalho atual, utilizamos a função getwd(), que retorna o caminho absoluto, ou seja, o endereço completo do diretório. Para definir um novo local de trabalho, utilizamos a função setwd(). Arquivos salvos nesse diretório podem ser lidos apenas com a indicação do nome do arquivo, pois podemos utilizar o caminho relativo, ou seja, o endereço do arquivo a partir do diretório em que estamos trabalhando.

1.3 Dados

Uma vez instalados os pacotes necessários, pode-se proceder à análise do corpus. Para isso, precisamos carregar o corpus no R. Se estamos trabalhando com dados armazenados localmente, isto é, disponíveis no computador onde as análises serão realizadas, basta utilizar a função readtext(), indicando o local (relativo ou absoluto) do arquivo desejado.

O livro ‘Dom Casmurro’ pode ser lido como um arquivo único,

# Para lermos um arquivo único com todo o conteúdo do livro

Dom_Casmurro <- readtext("~/corpora/Dom Casmurro.txt", encoding = "utf-8")

# Retorna a estrutura do objeto criado

str(Dom_Casmurro)

## Classes 'readtext' and 'data.frame': 1 obs. of  2 variables:
##  $ doc_id: chr "Dom Casmurro.txt"
##  $ text  : chr "DOM CASMURRO\n\nPOR\n\nMACHADO DE ASSIS\n\nDA ACADEMIA BRAZILEIRA\n\nH. GARNIER, LIVREIRO-EDITOR\n\nRUA MOREIRA"| __truncated__

Ou considerando o livro como uma unidade dentro de um corpora formado por vários documentos:

# Ler todos os arquivos na pasta ldod do diretório corpora 

Machado_files <- readtext("~/corpora/machado", encoding = "utf-8")

# Retornar a estrutura do objeto criado

str(Machado_files)

## Classes 'readtext' and 'data.frame': 4 obs. of  2 variables:
##  $ doc_id: chr  "A mão e a luva.txt" "Dom Casmurro.txt" "Memórias Braz Cubas.txt" "Quincas Borba.txt"
##  $ text  : chr  "A MÃO\n\nE\n\nA LUVA\n\nDe\n\nMACHADO DE ASSIS\n\nda Academia Brasileira\n\nLivraria Garnier\n\n109, Rua do Ouv"| __truncated__ "DOM CASMURRO\n\nPOR\n\nMACHADO DE ASSIS\n\nDA ACADEMIA BRAZILEIRA\n\nH. GARNIER, LIVREIRO-EDITOR\n\nRUA MOREIRA"| __truncated__ "MEMÓRIAS PÓSTHUMAS\n\nDE\n\nBRAZ CUBAS\n\nPOR\n\nMACHADO DE ASSIS\n\nRIO DE JANEIRO\n\nTYPOGRAPHIA NACIONAL\n\n"| __truncated__ "QUINCAS BORBA\n\nDO\n\nMACHADO DE ASSIS\n\nRIO DE JANEIRO\n\nB. L. GARNIER, LIVREIRO-EDITOR\n\n71, Rua do Ouvid"| __truncated__

Os textos acima derivam da obra Dom Casmurro de Machado de Assis, disponível quer no Projeto Gutenberg quer na Biblioteca Digital de Literatura de Países Lusófonos (UFSC).

Os arquivos foram salvos com a codificação utf-8 e informação para-textual e editorial (como notas dos editores) que pudessem interferir na pesquisa automática do software foram eliminadas.

As análises abaixo serão demonstradas utilizando os dois corpora, em diferentes momentos.

1.3.1 Limpeza

A limpeza abaixo foi aplicada apenas aos textos salvos separadamente (Machado_files). O arquivo com o livro em um único texto (Dom_Casmurro) já havia sido limpo anteriormente.

# Criamos uma cópia para recuperarmos o orignal caso haja erros na regex

machado_clean <- Machado_files

## remoção dos elementos indesejados

# Remover números no início de linhas (index)

machado_clean$text <- str_replace_all(machado_clean$text, "\\n\\d", "\n")

# Remover datas

machado_clean$text <- str_replace_all(machado_clean$text, "\\d{1,2}-(\\d{1,2}|[IVX]{1,4})-19\\d{2}", "")

1.4 Investigações com o Quanteda

Depois que os arquivos são carregados no sistema, precisamos criar um objeto “corpus”, ou seja, o formato necessário para que o Quanteda possa processar e gerar informações sobre o(s) texto(s). Para isso, basta aplicar a função corpus. Automaticamente, o texto é segmentado em tokens e frases. Tokens correspondem a todas as ocorrências (incluindo repetições) de palavras, bem como outros itens como pontuação, números e símbolos. Ao investigarmos o corpus com a função summary, obtemos a contagem de frases, tokens e types (o número de tokens distintos em um corpus).

# Criar o corpus de vários arquivos

corpus_clean <- corpus(machado_clean)

# Ver um resumo do corpus

summary(corpus_clean)

# Criar corpus do arquivo único

corpus_unico <- corpus(Dom_Casmurro)
summary(corpus_unico)

Caso seja necessário, podemos alterar a estrutura do nosso corpus. No corpus_unico, temos um corpus feito com apenas um texto. Com corpus_reshape podemos criar um novo corpus em que cada frase seja considerada um texto, ou seja, uma unidade.

# Revelar o número de textos no corpus

ndoc(corpus_unico)

## [1] 1

# Remodelar o corpus, tornando cada sentença uma unidade

corpus_sents <- corpus_reshape(corpus_unico, to = "sentences")

# Apresentar um resumo do corpus

summary(corpus_sents)

# Número total de unidades na nova formatação do corpus

ndoc(corpus_sents)

## [1] 3836

Os exemplos acima nos mostram que um corpus é um conjunto de textos com informações sobre cada texto (metadados), dos quais podemos extrair facilmente a contagem de tokens, types e frases para cada texto. Porém, para realizar análises quantitativas no corpus, precisamos quebrar os textos em tokens (tokenização). Também é possível filtrá-los, removendo elementos como pontuação, símbolos, números, URLs e separadores

# Tokenizar nossos três corpora

toks_unico <- tokens(corpus_unico)
toks_sents <- tokens(corpus_sents)
toks_files <- tokens(corpus_clean)


## Abaixo filtramos os três corpora de formas diversas,para demonstração
# Remover pontuação (corpus limpo com regex)

toks_nopunct_files <- tokens(corpus_clean, remove_punct = TRUE)
toks_nopunct_unico <- tokens(corpus_unico, remove_punct = TRUE)

# Remover números (corpus com apenas um arquivo)

toks_nonumbr <- tokens(corpus_unico, remove_numbers = TRUE)

# Remover separadores (Unicode "Separator" [Z] and "Control" [C] categories)  (corpus feito por frases)

toks_nosept <- tokens(corpus_sents, remove_separators = TRUE)

# Remover vários elementos ao mesmo tempo (corpus com apenas um arquivo)

toks_simples <- tokens(corpus_unico, remove_numbers = TRUE, remove_symbols = TRUE, remove_punct = TRUE)

É possível também remover tokens indesejados. Quanteda oferece uma lista de ‘stopwords’ para diferentes línguas. Stopwords, ou palavras vazias em português, são palavras a serem removidas durante o processamento de textos para análises computacionais. Não existe uma lista padrão, mas geralmente as stopwords são as palavras mais frequentemente utilizadas em uma língua, como preposições e artigos. O bloco abaixo elimina as palavras incluídas na lista de stopwords para o português e também inclui outras palavras que se repetem no corpus em questão.

# Eliminar stopwords do corpus feito com um único arquivo

toks_nostop <- tokens_select(toks_unico, pattern = stopwords("pt"), selection = "remove")

# Eliminar tokens específicios do corpus feito com vários arquivos e limpo com regex, após eliminação das pontuações

toks_selected_files <- tokens_select(toks_nopunct_files, pattern = c("nã", "£", "ã", "ha", "§", "©", "³", "á", "onde", "todo", "tão", "ter", "ella", "elle", "s", "é", "sã", "pã", "â", "jã", "tambem", "assim", "ia", "porque", "della", "delle", "tal", "ás", "lá", "d", "ás", "alguma", "alguns",  stopwords("pt")), selection = "remove")

# Eliminar tokens específicios do corpus feito com um arquivo, após eliminação das pontuações

toks_selected_unico <- tokens_select(toks_nopunct_unico, pattern = c("nã", "£", "ã", "ha", "§", "©", "³", "á", "onde", "todo", "tão", "ter", "ella", "elle", "s", "é", "sã", "pã", "â", "jã", "tambem", "assim", "ia", "porque", "della", "delle", "tal", "ás", "lá", "d", "ás", "alguma", "alguns", stopwords("pt")), selection = "remove")

Após a tokenização, o próximo passo é criar uma tabela com a frequência de cada token por cada texto, ou, nos termos do Quanteda, uma document-feature-matrix (DFM). A DFM é um pré-requisito para várias outras funções no Quanteda, como é o caso da topfeatures, que retorna os tokens mais frequentes em um corpus.

# Aqui podemos ver as 20 palavras mais frequentes quando removemos números, símbolos e pontuação

dfm_simples <- dfm(toks_simples)
print("com remoção de número, simbolos e pontuação")

## [1] "com remoção de número, simbolos e pontuação"

topfeatures(dfm_simples, 20)

##  que    a    e   de    o  não   um    é   os   da   do  mas   se  era para  com 
## 2659 2488 2185 1954 1687 1529  775  708  666  626  617  609  568  554  543  537 
##   as   eu   me   em 
##  533  531  489  462

dfm_nostop <- dfm(toks_nostop)
print("remoção de stopwords")

## [1] "remoção de stopwords"

topfeatures(dfm_nostop, 20)

##      ,      .      -      ;      é      ? capitú      !      á   elle    mãe 
##   6864   4647   1680   1088    708    360    341    281    262    239    228 
##      :   dias tambem   tudo   ella   casa    ser  olhos    mim 
##    223    191    189    188    186    170    167    164    162

dfm_selected_unico <- dfm(toks_selected_unico)
print("remoção de tokens selecionados no corpus previamente limpo com regex e sem stopwords")

## [1] "remoção de tokens selecionados no corpus previamente limpo com regex e sem stopwords"

topfeatures(dfm_selected_unico, 20)

##  capitú     mãe    dias    tudo    casa     ser   olhos     mim    josé     vez 
##     341     228     191     188     170     167     164     162     160     148 
##   agora   ainda   outra    nada   disse   tempo   padre     dia escobar   outro 
##     146     140     138     134     121     120     112     110     110     106

dfm_selected_files <- dfm(toks_selected_files)
print("remoção de tokens selecionados no corpus de arquivo único e sem stopwords")

## [1] "remoção de tokens selecionados no corpus de arquivo único e sem stopwords"

topfeatures(dfm_selected_files, 20)

##   rubião    olhos     casa     tudo     nada      ser    outra    disse 
##      699      607      546      532      511      508      500      492 
##    ainda capitulo    cousa    tempo    outro      vez    agora    homem 
##      490      464      461      436      431      415      382      366 
##      bem    pouco     dias    podia 
##      357      356      352      345

Depois de gerar a lista de tokens, podemos explorar o corpus. Uma das técnicas mais simples e utilizadas para investigação de corpus é através das linhas de concordância, também conhecidas como concordance lines ou keywords in context (kwic). As linhas de concordância mostram fragmentos do corpus onde ocorrem os termos buscados. O número de palavras no contexto pode ser estipulado pelo usuário, sendo 5 tokens à esquerda e 5 à direita o padrão. A primeira coluna indica o nome do arquivo onde a palavra buscada ocorre. Existem várias opções para buscas. Elas podem ser feitas por palavras ou por fragmentos, sequências ou combinações das mesmas.

# Ocorrências de palavras que iniciam com “feli”.

kwic(toks_unico, pattern =  "feli*")

# Podemos também procurar por mais de uma palavra ao mesmo tempo

kwic(toks_unico, pattern = c("feli*", "alegr*"))

# E por sequência de mais de um token

kwic(toks_unico, pattern = phrase("me fal*"))

1.4.1 N-gramas

Listas de frequência de palavras podem ser úteis para identificar elementos comuns em um texto. No entanto, em muitos casos, é igualmente importante saber em qual contexto essas palavras estão. Identificar quais palavras coocorrem frequentemente em um corpus pode nos fornecer ainda mais informações sobre o texto. Por exemplo, saber que a sequência ‘estou triste’ ocorre frequentemente no corpus nos proporciona insights mais ricos do que apenas a frequência da palavra ‘triste’ isoladamente. A sequência ‘estou triste’ é um exemplo do que chamamos de n-grams, ou, neste caso específico, bigramas. N-grams são sequências de duas ou mais palavras que ocorrem em um texto. Para gerar listas de n-grams, partimos de uma lista de tokens e especificamos o número mínimo e máximo de tokens em cada n-grama.

# Criar uma lista de 2-grama, 3-grama e 4-grama 

toks_ngram <- tokens_ngrams(toks_simples, n = 2:4)

# Visualizar apenas os 30 mais frequentes
head(toks_ngram[[1]], 30)

##  [1] "DOM_CASMURRO"            "CASMURRO_POR"           
##  [3] "POR_MACHADO"             "MACHADO_DE"             
##  [5] "DE_ASSIS"                "ASSIS_DA"               
##  [7] "DA_ACADEMIA"             "ACADEMIA_BRAZILEIRA"    
##  [9] "BRAZILEIRA_H"            "H_GARNIER"              
## [11] "GARNIER_LIVREIRO-EDITOR" "LIVREIRO-EDITOR_RUA"    
## [13] "RUA_MOREIRA"             "MOREIRA_CEZAR"          
## [15] "CEZAR_RIO"               "RIO_DE"                 
## [17] "DE_JANEIRO"              "JANEIRO_RUE"            
## [19] "RUE_DES"                 "DES_SAINTS-PÈRES"       
## [21] "SAINTS-PÈRES_PARIZ"      "PARIZ_I"                
## [23] "I_Do"                    "Do_titulo"              
## [25] "titulo_Uma"              "Uma_noite"              
## [27] "noite_destas"            "destas_vindo"           
## [29] "vindo_da"                "da_cidade"

1.4.2 Dicionário

Outra forma de extrair informações de um texto é através da criação de “dicionários”. A função dictionary no Quanteda permite agrupar tokens por categorias. Essa categorização pode então ser utilizada para buscas no corpus. Por exemplo, podemos criar as categorias “alegria” e “tristeza” contendo palavras relacionadas a esses sentimentos, respectivamente. Com o dicionário criado, podemos identificar a distribuição desses termos em um corpus.

# Criação de dicionário a partir do corpus formado por um único documento

dict <- dictionary(list(alegria = c("alegr*", "allegr*", "feli*", "content*"),
                        tristeza = c("trist*", "infeli*")))

dict_toks <- tokens_lookup(toks_unico, dictionary = dict)
print(dict_toks)

## Tokens consisting of 1 document.
## Dom Casmurro.txt :
##  [1] "alegria"  "alegria"  "alegria"  "alegria"  "tristeza" "alegria" 
##  [7] "alegria"  "alegria"  "alegria"  "alegria"  "alegria"  "alegria" 
## [ ... and 86 more ]

dfm(dict_toks)

## Document-feature matrix of: 1 document, 2 features (0.00% sparse) and 0 docvars.
##                   features
## docs               alegria tristeza
##   Dom Casmurro.txt      74       24

# Criação de dicionário a partir do corpus formado por vários documentos

dict <- dictionary(list(alegria = c("alegr*", "allegr*", "feli*", "content*"),
                        tristeza = c("trist*", "infeli*")))

dict_toks <- tokens_lookup(toks_files, dictionary = dict)
print(dict_toks)

## Tokens consisting of 4 documents.
## A mão e a luva.txt :
##  [1] "tristeza" "alegria"  "tristeza" "alegria"  "alegria"  "tristeza"
##  [7] "alegria"  "tristeza" "tristeza" "tristeza" "alegria"  "alegria" 
## [ ... and 95 more ]
## 
## Dom Casmurro.txt :
##  [1] "alegria"  "alegria"  "alegria"  "alegria"  "tristeza" "alegria" 
##  [7] "alegria"  "alegria"  "alegria"  "alegria"  "alegria"  "alegria" 
## [ ... and 86 more ]
## 
## Memórias Braz Cubas.txt :
##  [1] "tristeza" "tristeza" "alegria"  "tristeza" "tristeza" "alegria" 
##  [7] "alegria"  "tristeza" "alegria"  "alegria"  "alegria"  "alegria" 
## [ ... and 145 more ]
## 
## Quincas Borba.txt :
##  [1] "alegria"  "alegria"  "alegria"  "alegria"  "alegria"  "tristeza"
##  [7] "alegria"  "alegria"  "tristeza" "alegria"  "alegria"  "alegria" 
## [ ... and 116 more ]

dfm(dict_toks)

## Document-feature matrix of: 4 documents, 2 features (0.00% sparse) and 0 docvars.
##                          features
## docs                      alegria tristeza
##   A mão e a luva.txt           75       32
##   Dom Casmurro.txt             74       24
##   Memórias Braz Cubas.txt      86       71
##   Quincas Borba.txt            97       31

2 Visualização e análise dos dados

2.1 Nuvem de palavras e gráfico de frequência

Em 1.4, criamos uma DFM com a frequência dos tokens. Para absorver essas frequências de forma mais rápida, podemos gerar visualizações. Uma opção é a nuvem de palavras, um gráfico que permite a rápida visualização dos termos mais frequentes.

# Demonstração de como as frequências de palavras alteram de acordo com a preparação do corpus

set.seed(100) #para reprodução dos resultados
textplot_wordcloud(dfm_selected_unico, min_count = 6, random_order = FALSE, rotation = .25, color = RColorBrewer::brewer.pal(8, "Dark2"))

set.seed(100) 
textplot_wordcloud(dfm_selected_files, min_count = 6, random_order = FALSE, rotation = .25, color = RColorBrewer::brewer.pal(8, "Dark2"))

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## depressa could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## disse-me could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## qualquer could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## daquelle could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## passado could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## contrario could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## durante could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## memoria could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## entretanto could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## mamãe could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## sentimento could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## concluiu could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## placida could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## olhou could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## contos could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## dizendo could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## naquella could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## lagrimas could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## andando could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## major could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## natureza could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## haver could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## resposta could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## marcella could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, : pode
## could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## causa could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## ultimo could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## politica could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## morreu could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## casamento could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## quarto could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## demais could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## ouvir could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## opinião could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## passar could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## segundo could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## tornou could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## achei could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## negocio could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## amanhã could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## rapaz could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## viagem could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## primeiros could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## cidade could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## emquanto could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## amiga could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## livros could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## ouviu could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## poucos could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## posto could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## alguem could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## chacara could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## logar could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## deante could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## acabar could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## ingleza could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## necessidade could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## cosme could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## justina could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## cubas could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## paixão could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## jardim could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## possivel could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, : lobo
## could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## justamente could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## vespera could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## situação could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, : pés
## could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## achava could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## naquelle could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## ultima could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## ezequiel could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## disso could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, : hade
## could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## neste could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## sentia could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## pernas could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, : olhe
## could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## realmente could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## exemplo could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## braço could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, : anno
## could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## interesse could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## theatro could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## sangue could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## deixar could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## expressão could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## repetiu could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## capaz could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## sonhos could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## bocca could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## esperança could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## alegria could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## adeus could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## particular could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## podiam could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, : pôde
## could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## começou could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## cachorro could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## daqui could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## nomes could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## estado could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## saudades could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## passeio could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, : cedo
## could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## gabinete could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## cabellos could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, : seis
## could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## nesse could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## especie could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## almoço could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, : moço
## could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## olhando could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## olhava could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## missa could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## maneira could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## unico could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## vieram could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## ouvia could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## imaginação could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## exclamou could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## edade could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## vestido could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## motivo could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, : boca
## could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## vinham could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## comsigo could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## cadeira could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## interrompeu could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## voltou could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## alegre could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## perder could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## passos could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## pudesse could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, : obra
## could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## pensou could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## inteiramente could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## retrato could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## gostava could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## creatura could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## perto could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## anterior could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, : doce
## could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## impossivel could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## prazer could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## ministro could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## padua could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## papae could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## cocheiro could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## fallar could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## rindo could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, : pude
## could not be fit on page. It will not be plotted.

set.seed(100)
textplot_wordcloud(dfm_nostop, min_count = 6, random_order = FALSE, rotation = .25, color = RColorBrewer::brewer.pal(8, "Dark2"))

Outra solução é utilizar a biblioteca ggplot e representar em um gráfico o número de ocorrências das palavras mais frequentes.

# A partir do corpus formado por um único documento

dfm_selected_unico %>% 
  textstat_frequency(n = 20) %>% 
  ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point() +
  coord_flip() +
  labs(x = NULL, y = "Frequência") +
  theme_minimal()

# A partir de um corpus formado por vários documentos

dfm_selected_files %>% 
  textstat_frequency(n = 20) %>% 
  ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point() +
  coord_flip() +
  labs(x = NULL, y = "Frequência") +
  theme_minimal()

2.2 Topic modeling (LDA)

Outra função frequentemente utilizada no Processamento de Linguagem Natural (PLN) é a modelagem de tópicos, também conhecida como topic modeling (TM). A modelagem de tópicos aplica um modelo estatístico que busca compreender a estrutura do corpus e identificar e agrupar palavras que se relacionam de alguma forma entre si. O TM utiliza uma técnica semi ou não supervisionada para identificação desses tópicos. Em outras palavras, o programa aprende a reconhecer padrões nos dados sem a necessidade de anotações prévias. O código abaixo demonstra a aplicação do modelo Latent Dirichlet Allocation (LDA).

# Topic modeling a partir do corpus formado por um único documento
lda <- LDA(dfm_selected_unico, k = 10)
terms(lda, 10)

##       Topic 1  Topic 2   Topic 3 Topic 4  Topic 5 Topic 6    Topic 7  Topic 8 
##  [1,] "ser"    "escobar" "mãe"   "capitú" "casa"  "capitú"   "capitú" "capitú"
##  [2,] "olhos"  "dias"    "outra" "tudo"   "mãe"   "ser"      "mãe"    "olhos" 
##  [3,] "capitú" "nada"    "dias"  "casa"   "então" "olhos"    "mim"    "dias"  
##  [4,] "outros" "tempo"   "tempo" "dias"   "josé"  "casa"     "olhos"  "tudo"  
##  [5,] "vez"    "olhos"   "josé"  "agora"  "nada"  "mãe"      "vez"    "josé"  
##  [6,] "sim"    "capitú"  "ainda" "vez"    "ainda" "mim"      "outra"  "nada"  
##  [7,] "dia"    "vida"    "dia"   "olhos"  "dias"  "ainda"    "josé"   "padre" 
##  [8,] "padre"  "vez"     "então" "mim"    "outro" "padre"    "disse"  "pae"   
##  [9,] "pois"   "mal"     "pae"   "ainda"  "tudo"  "escobar"  "agora"  "outra" 
## [10,] "ainda"  "tudo"    "vez"   "mãe"    "aqui"  "palavras" "dias"   "póde"  
##       Topic 9  Topic 10  
##  [1,] "capitú" "tudo"    
##  [2,] "mãe"    "casa"    
##  [3,] "ser"    "vez"     
##  [4,] "dias"   "ser"     
##  [5,] "tudo"   "mãe"     
##  [6,] "nada"   "escobar" 
##  [7,] "agora"  "dias"    
##  [8,] "disse"  "tempo"   
##  [9,] "tempo"  "disse"   
## [10,] "ainda"  "palavras"

# Topic modeling a partir de um corpus formado por vários documentos
lda <- LDA(dfm_selected_files, k = 10)
terms(lda, 10)

##       Topic 1   Topic 2  Topic 3  Topic 4    Topic 5    Topic 6    Topic 7
##  [1,] "capitú"  "outra"  "rubião" "rubião"   "olhos"    "virgilia" "mim"  
##  [2,] "tudo"    "disse"  "palha"  "ainda"    "si"       "capitulo" "cousa"
##  [3,] "casa"    "talvez" "sophia" "bem"      "nada"     "cousa"    "vez"  
##  [4,] "mãe"     "tudo"   "rua"    "homem"    "outra"    "olhos"    "menos"
##  [5,] "dias"    "nada"   "maria"  "casa"     "tudo"     "menos"    "ser"  
##  [6,] "josé"    "vez"    "outro"  "olhos"    "podia"    "tempo"    "dia"  
##  [7,] "mim"     "ainda"  "casa"   "tudo"     "cousa"    "pouco"    "creio"
##  [8,] "olhos"   "casa"   "ser"    "capitulo" "ir"       "outro"    "filho"
##  [9,] "ser"     "melhor" "outra"  "sophia"   "fez"      "mim"      "disse"
## [10,] "escobar" "sim"    "si"     "então"    "baroneza" "ser"      "outra"
##       Topic 8    Topic 9  Topic 10  
##  [1,] "guiomar"  "mãe"    "rubião"  
##  [2,] "estevão"  "outro"  "sophia"  
##  [3,] "baroneza" "agora"  "capitulo"
##  [4,] "luiz"     "outra"  "maria"   
##  [5,] "alves"    "dizer"  "nada"    
##  [6,] "coração"  "podia"  "disse"   
##  [7,] "moça"     "capitú" "outra"   
##  [8,] "disse"    "ir"     "marido"  
##  [9,] "jorge"    "ainda"  "olhos"   
## [10,] "mrs"      "dia"    "amigo"

2.3 Semantic Network

O Feature co-occurrence matrix (FCM) é semelhante ao DFM, mas considera as coocorrências, apresentando um gráfico com as redes semânticas.

# Rede a partir do corpus formado por um único documento

# Criar fcm a partir de dfm
fcm_nostop <- fcm(dfm_selected_unico)

# Listar as top features
feat <- names(topfeatures(dfm_selected_unico, 50)) 

# Selecionar
fcm_select <- fcm_select(fcm_nostop, pattern = feat, selection = "keep") 

size <- log(colSums(dfm_select(dfm_selected_unico, feat, selection = "keep"))) 

textplot_network(fcm_select, min_freq = 0.8, vertex_size = size / max(size) * 3)

# Rede a partir de um corpus formado por vários documentos

# Criar fcm a partir de dfm
fcm_nostop <- fcm(dfm_selected_files)

# Listar as top features
feat <- names(topfeatures(dfm_selected_files, 50)) 

# Selecionar
fcm_select <- fcm_select(fcm_nostop, pattern = feat, selection = "keep") 

size <- log(colSums(dfm_select(dfm_selected_files, feat, selection = "keep"))) 

textplot_network(fcm_select, min_freq = 0.8, vertex_size = size / max(size) * 3)

Dados e repositório

Os dados e códigos estão disponíveis via github https://github.com/DiegoEGimenez/R_literatura_Quanteda

O código pode ser visualizado em https://rpubs.com/DiegoEGimenez/1191458

Agradecimentos

Este documento (2024) contém uma revisão e ampliação de códigos originalmente preparados por Diego Giménez e Andressa Gomide em 2022 para a análise do “Livro do Desassossego”. Alguns dos códigos descritos no documento de 2022 utilizaram os códigos gentilmente cedidos por Mark Alfano, usados em seu trabalho “Nietzsche corpus analysis”.