Licença
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.
Citação
Sugestão de citação: FIGUEIREDO, Adriano Marcos Rodrigues. Text Mining e Nuvem de palavras com R: Bíblia, Evangelho segundo Mateus. Campo Grande-MS,Brasil: RStudio/Rpubs, 2019. Disponível em http://rpubs.com/amrofi/word_cloud_with_R_Mateus.
Introdução
Neste arquivo utilizo uma nuvem de palavras para ver as palavras mais impactantes do Evangelho segundo Mateus. O arquivo básico veio de https://sites.google.com/site/biblialivre/arquivos. Realizei uma pre-limpeza retirando acentos, cedilha, o cabeçalho, nomes dos Capítulos e pontuações do português.
Texto básico: Título: Bíblia Livre. Nome do revisor: Diego Renato dos Santos, Fonte: http://sites.google.com/site/biblialivre/. Licença: Creative Commons Atribuição 3.0 Brasil, Se houver restrição de espaço, basta usar a sigla BLIVRE.
library(readr)
words <- readChar("mat_limpo.txt", file.info("mat_limpo.txt")$size)
Text mining
O arquivo será convertido num vetor de caracteres para a mineração do texto.
str(words)
chr "1. Livro da geracao de Jesus Cristo, filho de Davi, filho de Abraao.\r\n2. Abraao gerou a Isaque; e Isaque g"| __truncated__
words <- as.character(words)
O pacote tm
(text minig) limpará o texto. Primeiro, converto o texto para um ‘Corpus’, ou uma lista do texto que usaremos.
library(tm) #text mining
## Loading required package: NLP
library(dplyr)
word.corpus <- Corpus(VectorSource(words)) #Corpus
Aqui, os termos precisam de limpeza. Pode ser feito com ‘tm_map()’ removendo elementos como pontuação, espaços e números:
word.corpus<-word.corpus%>%
tm_map(removePunctuation)%>% ##eliminar pontuacao
tm_map(removeNumbers)%>% #sem numeros
tm_map(stripWhitespace)# sem espacos
Também removemos palavras comuns como “e” or “do” e outras palavras (stopwords).
word.corpus<-word.corpus%>%
tm_map(tolower)%>% ##make all words lowercase
tm_map(removeWords, stopwords("por"))
Ou outras palavras,
word.corpus <- tm_map(word.corpus, removeWords, c("nao", "porque", "entao",
"Ref.", "that", "with", "will", "also", "i'm"))
Text stemming pode ser usada para reduzir multiplos/derivações da mesma palavra.
word.corpus <- tm_map(word.corpus, stemDocument)
Term Frequencies
Em seguida, as frequências de palavras são calculadas baseadas no texto limpo. Isto ranqueará as palavras mostrando os termos mais usados.
word.counts <- as.matrix(TermDocumentMatrix(word.corpus))
word.freq <- sort(rowSums(word.counts), decreasing = TRUE)
head(word.freq) ##what are the top words?
jesus diss filho dizendo senhor porem
175 140 107 105 83 77
# jesus diss filho dizendo senhor porem 176 140 107 105 85 77
Word cloud
library(wordcloud) #wordcloud
set.seed(32) #be sure to set the seed if you want to reproduce the same again
wordcloud(words = names(word.freq), freq = word.freq, scale = c(3, 0.5), max.words = 100,
random.order = TRUE)

Pode-se customizar de várias formas. Mude a escala para manipular o texto em tamanho e cor.
library(wesanderson)
wordcloud(words = names(word.freq), freq = word.freq, scale = c(4, 0.3), max.words = 100,
random.order = TRUE, color = wes_palette("Darjeeling1"))

A ordem aleatória (random.order
) afeta a forma que as palavras frequentes são alocadas centralizadas ou não. A opção max.words
define o número de palavras para apresentar. A rot.per
estabelece o número de palavras rotacionadas.
wordcloud(words = names(word.freq), freq = word.freq, scale = c(4, 0.3), max.words = 100,
random.order = FALSE, color = wes_palette("Darjeeling1"), rot.per = 0.7)

LS0tDQp0aXRsZTogIlRleHQgTWluaW5nIGUgTnV2ZW0gZGUgcGFsYXZyYXMgY29tIFI6IELDrWJsaWEsIEV2YW5nZWxobyBzZWd1bmRvIE1hdGV1cyINCmF1dGhvcjogIkFkcmlhbm8gTWFyY29zIFJvZHJpZ3VlcyBGaWd1ZWlyZWRvLCAqZS1tYWlsOiBhZHJpYW5vLmZpZ3VlaXJlZG9AdWZtcy5icioiDQpsaW5rY29sb3I6IGJsdWUNCmFic3RyYWN0OiANCiAgVGhpcyBpcyBhbiB1bmRlcmdyYWQgc3R1ZGVudCBsZXZlbCBpbnN0cnVjdGlvbiBmb3IgY2xhc3MgdXNlLiAgDQpkYXRlOiAiYHIgZm9ybWF0KFN5cy5EYXRlKCksICclZCAlQiAlWScpYCINCm91dHB1dDoNCiAgaHRtbF9kb2N1bWVudDoNCiAgICBjb2RlX2Rvd25sb2FkOiB0cnVlDQogICAgdGhlbWU6IGRlZmF1bHQNCiAgICBudW1iZXJfc2VjdGlvbnM6IHRydWUNCiAgICB0b2M6IHllcw0KICAgIHRvY19mbG9hdDogeWVzDQogICAgZGZfcHJpbnQ6IHBhZ2VkDQogICAgZmlnX2NhcHRpb246IHRydWUNCiAgcGRmX2RvY3VtZW50Og0KICAgIHRvYzogeWVzDQogIHdvcmRfZG9jdW1lbnQ6DQogICAgdG9jOiB5ZXMNCiAgICBoaWdobGlnaHQ6ICJ0YW5nbyINCiAgICByZWZlcmVuY2VfZG9jeDogIkM6XFxVc2Vyc1xcYW1yb2ZcXERvY3VtZW50c1xcUlxcZHJhZnQtc3R5bGVzLlJtZFxcd29yZC1zdHlsZXMtcmVmZXJlbmNlLTAxLmRvY3giDQotLS0NCg0KYGBge3Iga25pdHJfaW5pdCwgZWNobz1GQUxTRSwgY2FjaGU9RkFMU0V9DQpsaWJyYXJ5KGtuaXRyKQ0KbGlicmFyeShybWFya2Rvd24pDQpsaWJyYXJ5KHJtZGZvcm1hdHMpDQoNCiMjIEdsb2JhbCBvcHRpb25zDQpvcHRpb25zKG1heC5wcmludD0iMTAwIikNCm9wdHNfY2h1bmskc2V0KGVjaG89VFJVRSwNCgkgICAgICAgICAgICAgY2FjaGU9VFJVRSwNCiAgICAgICAgICAgICAgIHByb21wdD1GQUxTRSwNCiAgICAgICAgICAgICAgIHRpZHk9VFJVRSwNCiAgICAgICAgICAgICAgIGNvbW1lbnQ9TkEsDQogICAgICAgICAgICAgICBtZXNzYWdlPUZBTFNFLA0KICAgICAgICAgICAgICAgd2FybmluZz1GQUxTRSkNCm9wdHNfa25pdCRzZXQod2lkdGg9MTAwKQ0KYGBgDQoNCg0KTGljZW7Dp2Egey0jTGljZW7Dp2F9DQo9PT09PT09PT09PT09PT09PT09DQoNClRoaXMgd29yayBpcyBsaWNlbnNlZCB1bmRlciB0aGUgQ3JlYXRpdmUgQ29tbW9ucyBBdHRyaWJ1dGlvbi1TaGFyZUFsaWtlIDQuMCBJbnRlcm5hdGlvbmFsIExpY2Vuc2UuIFRvIHZpZXcgYSBjb3B5IG9mIHRoaXMgbGljZW5zZSwgdmlzaXQgPGh0dHA6Ly9jcmVhdGl2ZWNvbW1vbnMub3JnL2xpY2Vuc2VzL2J5LXNhLzQuMC8+IG9yIHNlbmQgYSBsZXR0ZXIgdG8gQ3JlYXRpdmUgQ29tbW9ucywgUE8gQm94IDE4NjYsIE1vdW50YWluIFZpZXcsIENBIDk0MDQyLCBVU0EuDQoNCiFbTGljZW5zZTogQ0MgQlktU0EgNC4wXShodHRwczovL21pcnJvcnMuY3JlYXRpdmVjb21tb25zLm9yZy9wcmVzc2tpdC9idXR0b25zLzg4eDMxL3BuZy9ieS1zYS5wbmcpeyB3aWR0aD0yNSUgfQ0KDQpDaXRhw6fDo28gey0jQ2l0YcOnw6NvfQ0KPT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT0NCg0KU3VnZXN0w6NvIGRlIGNpdGHDp8OjbzoNCkZJR1VFSVJFRE8sIEFkcmlhbm8gTWFyY29zIFJvZHJpZ3Vlcy4gVGV4dCBNaW5pbmcgZSBOdXZlbSBkZSBwYWxhdnJhcyBjb20gUjogQsOtYmxpYSwgRXZhbmdlbGhvIHNlZ3VuZG8gTWF0ZXVzLiBDYW1wbyBHcmFuZGUtTVMsQnJhc2lsOiBSU3R1ZGlvL1JwdWJzLCAyMDE5LiBEaXNwb27DrXZlbCBlbSA8aHR0cDovL3JwdWJzLmNvbS9hbXJvZmkvd29yZF9jbG91ZF93aXRoX1JfTWF0ZXVzPi4gDQoNCkludHJvZHXDp8Ojbw0KPT09PT09PT09PT09PT09PT09PQ0KDQpOZXN0ZSBhcnF1aXZvIHV0aWxpem8gdW1hIG51dmVtIGRlIHBhbGF2cmFzIHBhcmEgdmVyIGFzIHBhbGF2cmFzIG1haXMgaW1wYWN0YW50ZXMgZG8gRXZhbmdlbGhvIHNlZ3VuZG8gTWF0ZXVzLiBPIGFycXVpdm8gYsOhc2ljbyB2ZWlvIGRlIDxodHRwczovL3NpdGVzLmdvb2dsZS5jb20vc2l0ZS9iaWJsaWFsaXZyZS9hcnF1aXZvcz4uIFJlYWxpemVpIHVtYSBwcmUtbGltcGV6YSByZXRpcmFuZG8gYWNlbnRvcywgY2VkaWxoYSwgbyBjYWJlw6dhbGhvLCBub21lcyBkb3MgQ2Fww610dWxvcyBlIHBvbnR1YcOnw7VlcyBkbyBwb3J0dWd1w6pzLg0KDQo+IFRleHRvIGLDoXNpY286IFTDrXR1bG86IELDrWJsaWEgTGl2cmUuIE5vbWUgZG8gcmV2aXNvcjogRGllZ28gUmVuYXRvIGRvcyBTYW50b3MsIEZvbnRlOiBodHRwOi8vc2l0ZXMuZ29vZ2xlLmNvbS9zaXRlL2JpYmxpYWxpdnJlLy4gTGljZW7Dp2E6IENyZWF0aXZlIENvbW1vbnMgQXRyaWJ1acOnw6NvIDMuMCBCcmFzaWwsIFNlIGhvdXZlciByZXN0cmnDp8OjbyBkZSBlc3Bhw6dvLCBiYXN0YSB1c2FyIGEgc2lnbGEgQkxJVlJFLg0KDQoNCmBgYHtyfQ0KbGlicmFyeShyZWFkcikNCndvcmRzPC1yZWFkQ2hhcignbWF0X2xpbXBvLnR4dCcsIGZpbGUuaW5mbygnbWF0X2xpbXBvLnR4dCcpJHNpemUpDQpgYGANCg0KVGV4dCBtaW5pbmcNCj09PT09PT09PT09PT09PT09PT09DQoNCk8gYXJxdWl2byBzZXLDoSBjb252ZXJ0aWRvIG51bSB2ZXRvciBkZSBjYXJhY3RlcmVzIHBhcmEgYSBtaW5lcmHDp8OjbyBkbyB0ZXh0by4NCg0KYGBge3J9DQpzdHIod29yZHMpDQpgYGANCg0KYGBge3J9DQp3b3JkczwtYXMuY2hhcmFjdGVyKHdvcmRzKQ0KYGBgDQoNCk8gcGFjb3RlIGB0bWAgKHRleHQgbWluaWcpIGxpbXBhcsOhIG8gdGV4dG8uIFByaW1laXJvLCBjb252ZXJ0byBvIHRleHRvIHBhcmEgdW0g4oCYQ29ycHVz4oCZLCBvdSB1bWEgbGlzdGEgZG8gdGV4dG8gcXVlIHVzYXJlbW9zLg0KDQpgYGB7cn0NCmxpYnJhcnkodG0pICN0ZXh0IG1pbmluZw0KIyMgTG9hZGluZyByZXF1aXJlZCBwYWNrYWdlOiBOTFANCmxpYnJhcnkoZHBseXIpDQp3b3JkLmNvcnB1czwtQ29ycHVzKFZlY3RvclNvdXJjZSh3b3JkcykpICNDb3JwdXMNCmBgYA0KDQpBcXVpLCBvcyB0ZXJtb3MgcHJlY2lzYW0gZGUgbGltcGV6YS4gUG9kZSBzZXIgZmVpdG8gY29tIOKAmHRtX21hcCgp4oCZIHJlbW92ZW5kbyBlbGVtZW50b3MgY29tbyBwb250dWHDp8OjbywgZXNwYcOnb3MgZSBuw7ptZXJvczoNCg0KYGBge3J9DQp3b3JkLmNvcnB1czwtd29yZC5jb3JwdXMlPiUNCiAgdG1fbWFwKHJlbW92ZVB1bmN0dWF0aW9uKSU+JSAjI2VsaW1pbmFyIHBvbnR1YWNhbw0KICB0bV9tYXAocmVtb3ZlTnVtYmVycyklPiUgI3NlbSBudW1lcm9zDQogIHRtX21hcChzdHJpcFdoaXRlc3BhY2UpIyBzZW0gZXNwYWNvcw0KYGBgDQoNClRhbWLDqW0gcmVtb3ZlbW9zIHBhbGF2cmFzIGNvbXVucyBjb21vIOKAnGXigJ0gb3Ig4oCcZG/igJ0gZSBvdXRyYXMgcGFsYXZyYXMgKHN0b3B3b3JkcykuDQoNCmBgYHtyfQ0Kd29yZC5jb3JwdXM8LXdvcmQuY29ycHVzJT4lDQogIHRtX21hcCh0b2xvd2VyKSU+JSAjI21ha2UgYWxsIHdvcmRzIGxvd2VyY2FzZQ0KICB0bV9tYXAocmVtb3ZlV29yZHMsIHN0b3B3b3JkcygicG9yIikpDQpgYGANCg0KT3Ugb3V0cmFzIHBhbGF2cmFzLA0KDQpgYGB7cn0NCndvcmQuY29ycHVzIDwtIHRtX21hcCh3b3JkLmNvcnB1cywgcmVtb3ZlV29yZHMsIGMoIm5hbyIsICJwb3JxdWUiLCJlbnRhbyIsIlJlZi4iLCJ0aGF0Iiwid2l0aCIsIndpbGwiLCJhbHNvIiwiaSdtIikpIA0KYGBgDQoNCipUZXh0IHN0ZW1taW5nKiBwb2RlIHNlciB1c2FkYSBwYXJhIHJlZHV6aXIgbXVsdGlwbG9zL2Rlcml2YcOnw7VlcyBkYSBtZXNtYSBwYWxhdnJhLg0KDQpgYGB7cn0NCndvcmQuY29ycHVzPC10bV9tYXAod29yZC5jb3JwdXMsIHN0ZW1Eb2N1bWVudCkNCmBgYA0KDQpUZXJtIEZyZXF1ZW5jaWVzDQo9PT09PT09PT09PT09PT09PT09DQoNCkVtIHNlZ3VpZGEsIGFzIGZyZXF1w6puY2lhcyBkZSBwYWxhdnJhcyBzw6NvIGNhbGN1bGFkYXMgYmFzZWFkYXMgbm8gdGV4dG8gbGltcG8uIElzdG8gcmFucXVlYXLDoSBhcyBwYWxhdnJhcyBtb3N0cmFuZG8gb3MgdGVybW9zIG1haXMgdXNhZG9zLg0KDQpgYGB7cn0NCndvcmQuY291bnRzPC1hcy5tYXRyaXgoVGVybURvY3VtZW50TWF0cml4KHdvcmQuY29ycHVzKSkNCndvcmQuZnJlcTwtc29ydChyb3dTdW1zKHdvcmQuY291bnRzKSwgZGVjcmVhc2luZz1UUlVFKQ0KaGVhZCh3b3JkLmZyZXEpIyN3aGF0IGFyZSB0aGUgdG9wIHdvcmRzPw0KIyAgamVzdXMgICAgZGlzcyAgIGZpbGhvIGRpemVuZG8gIHNlbmhvciAgIHBvcmVtIA0KIyAgICAxNzYgICAgIDE0MCAgICAgMTA3ICAgICAxMDUgICAgICA4NSAgICAgIDc3IA0KYGBgDQoNCldvcmQgY2xvdWQNCj09PT09PT09PT09PT09PT09PT09DQoNCmBgYHtyfQ0KbGlicmFyeSh3b3JkY2xvdWQpICN3b3JkY2xvdWQNCnNldC5zZWVkKDMyKSAjYmUgc3VyZSB0byBzZXQgdGhlIHNlZWQgaWYgeW91IHdhbnQgdG8gcmVwcm9kdWNlIHRoZSBzYW1lIGFnYWluDQoNCndvcmRjbG91ZCh3b3Jkcz1uYW1lcyh3b3JkLmZyZXEpLCBmcmVxPXdvcmQuZnJlcSwgc2NhbGU9YygzLC41KSxtYXgud29yZHMgPSAxMDAsIHJhbmRvbS5vcmRlciA9IFRSVUUpDQpgYGANCg0KUG9kZS1zZSBjdXN0b21pemFyIGRlIHbDoXJpYXMgZm9ybWFzLiBNdWRlIGEgZXNjYWxhIHBhcmEgbWFuaXB1bGFyIG8gdGV4dG8gZW0gdGFtYW5obyBlIGNvci4NCg0KYGBge3J9DQpsaWJyYXJ5KHdlc2FuZGVyc29uKQ0KDQp3b3JkY2xvdWQod29yZHM9bmFtZXMod29yZC5mcmVxKSwgZnJlcT13b3JkLmZyZXEsIHNjYWxlPWMoNCwuMyksbWF4LndvcmRzID0gMTAwLCANCiAgICAgICAgICByYW5kb20ub3JkZXIgPSBUUlVFLCBjb2xvcj13ZXNfcGFsZXR0ZSgiRGFyamVlbGluZzEiKSkNCmBgYA0KDQpBIG9yZGVtIGFsZWF0w7NyaWEgKGByYW5kb20ub3JkZXJgKSBhZmV0YSBhIGZvcm1hIHF1ZSBhcyBwYWxhdnJhcyBmcmVxdWVudGVzIHPDo28gYWxvY2FkYXMgY2VudHJhbGl6YWRhcyBvdSBuw6NvLiBBIG9ww6fDo28gYG1heC53b3Jkc2AgZGVmaW5lIG8gbsO6bWVybyBkZSBwYWxhdnJhcyBwYXJhIGFwcmVzZW50YXIuIEEgYHJvdC5wZXJgIGVzdGFiZWxlY2UgbyBuw7ptZXJvIGRlIHBhbGF2cmFzIHJvdGFjaW9uYWRhcy4NCg0KYGBge3J9DQp3b3JkY2xvdWQod29yZHM9bmFtZXMod29yZC5mcmVxKSwgZnJlcT13b3JkLmZyZXEsIHNjYWxlPWMoNCwuMyksbWF4LndvcmRzID0gMTAwLCANCiAgICAgICAgICByYW5kb20ub3JkZXIgPSBGQUxTRSwgY29sb3I9d2VzX3BhbGV0dGUoIkRhcmplZWxpbmcxIikscm90LnBlcj0uNykNCmBgYA0KDQo=