Caricate i pacchetti necessari

library(dplyr)
library(aRxiv)
library(tm)
library(wordcloud)

L’archivio arXiv è una repository di paper scientifici.
Il pacchetto di R aRxiv permette di estrarre (tramite API) alcuni dati dall’archivio.

Estraiamo alcuni papers che presentano il termine Stereotype

Papers <- arxiv_search(query = '"Stereotype"', limit = 200)

Questo è un abstract tra quelli degli articoli estratti

head(Papers$abstract, 1)
[1] "  It is argued that colour name strategy, object name strategy, and chunking\nstrategy in memory are all aspects of the same general phenomena, called\nstereotyping. It is pointed out that the Berlin-Kay universal partial ordering\nof colours and the frequency of traffic accidents classified by colour are\nsurprisingly similar. Some consequences of the existence of a name strategy for\nthe philosophy of language and mathematics are discussed. It is argued that\nreal valued quantities occur {\\it ab initio}. The implication of real valued\ntruth quantities is that the {\\bf Continuum Hypothesis} of pure mathematics is\nside-stepped. The existence of name strategy shows that thought/sememes and\ntalk/phonemes can be separate, and this vindicates the assumption of thought\noccurring before talk used in psycholinguistic speech production models.\n"

Per estrarre le informazioni bisogna creare quello che viene chiamato Text Corpus

Corpus = with(Papers, VCorpus(VectorSource(abstract)))

Eliminiamo le stop words (numeri, punteggiatura ecc..)

Corpus <- Corpus %>%
  tm_map(stripWhitespace) %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removeWords, stopwords("english"))

Gli abstract appaiono così adesso:

strwrap(as.character(Corpus[[1]]))
 [1] "argued colour name strategy object name strategy chunking strategy"
 [2] "memory aspects general phenomena called stereotyping pointed"      
 [3] "berlinkay universal partial ordering colours frequency traffic"    
 [4] "accidents classified colour surprisingly similar consequences"     
 [5] "existence name strategy philosophy language mathematics discussed" 
 [6] "argued real valued quantities occur ab initio implication real"    
 [7] "valued truth quantities bf continuum hypothesis pure mathematics"  
 [8] "sidestepped existence name strategy shows thoughtsememes"          
 [9] "talkphonemes can separate vindicates assumption thought occurring" 
[10] "talk used psycholinguistic speech production models"               

 Creiamo il Word Cloud

wordcloud(Corpus, max.words = 100, scale = c(8,1),
          colors = brewer.pal(30, "Set2"), random.color = TRUE)

Proviamo con la parola “Gender”

Papers <- arxiv_search(query = '"Gender"', limit = 200)
Corpus = with(Papers, VCorpus(VectorSource(abstract)))
Corpus <- Corpus %>%
  tm_map(stripWhitespace) %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removeWords, stopwords("english"))
wordcloud(Corpus, max.words = 100, scale = c(8,1),
          colors = brewer.pal(30, "Set1"), random.color = TRUE)

Oppure con la parola “Data Science”"

Papers <- arxiv_search(query = '"Data Science"', limit = 200)
Corpus = with(Papers, VCorpus(VectorSource(abstract)))
Corpus <- Corpus %>%
  tm_map(stripWhitespace) %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removeWords, stopwords("english"))
wordcloud(Corpus, max.words = 100, scale = c(8,1),
          colors = brewer.pal(30, "Set3"), random.color = TRUE)

Questa operazione è riproducibile con diverse fonti testuali. Per semplicità qui ho utilizzato l’archivio aRxiv.