This paper is my attempt to find if there any influence of chosen seed at the word distribution. As before, the en_US locale is used.
Load the whole data:
Now seeds, I used a method of typing random numbers by keyboard, so they are true-random, not pseudo.
construct_seed <- function(seed) {
set.seed(seed = seed)
blogs <- corpus_sample(blogs, size = 5000)
news <- corpus_sample(news, size = 5000)
twitter <- corpus_sample(twitter, size = 5000)
docnames(blogs) <- c(1:5000)
docnames(news) <- c(5001:10000)
docnames(twitter) <- c(10001:15000)
corp <- blogs + news + twitter
tokens <- tokens(corp)
#remove numbers and punctuation
tokens <- tokens(tokens, remove_punct = T, remove_numbers = T)
#remove stopwords
tokens <- tokens_select(tokens, pattern = stopwords('en'), selection = 'remove')
#Make matrix
matrx <- dfm(tokens)
matrx <- as.matrix(matrx)
words <- sort(colSums(matrx), decreasing = TRUE)
df <- data.frame(word = names(words), freq = words)
wc <- wordcloud(words = df$word, freq = df$freq,
min.freq = 1,
max.words = 300,
random.order = FALSE,
rot.per = 0.35,
colors = brewer.pal(8, 'Dark2'))
arg_name <- deparse(substitute(seed)) # Get argument name
var_name <- paste("wc", arg_name, sep="_") # Construct the name
assign(var_name, wc, env=.GlobalEnv) # Assign values to variable
# variable will be created in .GlobalEnv
}
I will compare the output of my function with different seeds using visual method (word clouds).
construct_seed(seed = 362)
construct_seed(seed = 24734)
construct_seed(seed = 346)
construct_seed(seed = 12354)
Looks like there’s no significant difference, so any random seed can be used in constructing the final model.