library(dplyr)
library(aRxiv)
library(tm)
library(wordcloud)
L’archivio arXiv è una repository di paper scientifici.
Il pacchetto di R aRxiv permette di estrarre (tramite API) alcuni dati dall’archivio.
Estraiamo alcuni papers che presentano il termine Stereotype
Papers <- arxiv_search(query = '"Stereotype"', limit = 200)
head(Papers$abstract, 1)
[1] " It is argued that colour name strategy, object name strategy, and chunking\nstrategy in memory are all aspects of the same general phenomena, called\nstereotyping. It is pointed out that the Berlin-Kay universal partial ordering\nof colours and the frequency of traffic accidents classified by colour are\nsurprisingly similar. Some consequences of the existence of a name strategy for\nthe philosophy of language and mathematics are discussed. It is argued that\nreal valued quantities occur {\\it ab initio}. The implication of real valued\ntruth quantities is that the {\\bf Continuum Hypothesis} of pure mathematics is\nside-stepped. The existence of name strategy shows that thought/sememes and\ntalk/phonemes can be separate, and this vindicates the assumption of thought\noccurring before talk used in psycholinguistic speech production models.\n"
Per estrarre le informazioni bisogna creare quello che viene chiamato Text Corpus
Corpus = with(Papers, VCorpus(VectorSource(abstract)))
Corpus <- Corpus %>%
tm_map(stripWhitespace) %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeWords, stopwords("english"))
Gli abstract appaiono così adesso:
strwrap(as.character(Corpus[[1]]))
[1] "argued colour name strategy object name strategy chunking strategy"
[2] "memory aspects general phenomena called stereotyping pointed"
[3] "berlinkay universal partial ordering colours frequency traffic"
[4] "accidents classified colour surprisingly similar consequences"
[5] "existence name strategy philosophy language mathematics discussed"
[6] "argued real valued quantities occur ab initio implication real"
[7] "valued truth quantities bf continuum hypothesis pure mathematics"
[8] "sidestepped existence name strategy shows thoughtsememes"
[9] "talkphonemes can separate vindicates assumption thought occurring"
[10] "talk used psycholinguistic speech production models"
wordcloud(Corpus, max.words = 100, scale = c(8,1),
colors = brewer.pal(30, "Set2"), random.color = TRUE)
Papers <- arxiv_search(query = '"Gender"', limit = 200)
Corpus = with(Papers, VCorpus(VectorSource(abstract)))
Corpus <- Corpus %>%
tm_map(stripWhitespace) %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeWords, stopwords("english"))
wordcloud(Corpus, max.words = 100, scale = c(8,1),
colors = brewer.pal(30, "Set1"), random.color = TRUE)
Papers <- arxiv_search(query = '"Data Science"', limit = 200)
Corpus = with(Papers, VCorpus(VectorSource(abstract)))
Corpus <- Corpus %>%
tm_map(stripWhitespace) %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeWords, stopwords("english"))
wordcloud(Corpus, max.words = 100, scale = c(8,1),
colors = brewer.pal(30, "Set3"), random.color = TRUE)