Recently I have been using R for some basic data visualisations, outputs like word clouds and heat maps. I don’t have a programming background so upon first look the R command line based environment can seem a little daunting. However, the ease at which I have been able to create some pretty amazing outputs with very little code has surprised me. In this blog I will attempt to share the steps in a simple process as well as the small amount of code that is needed.
# Install required packages
#install.packages(c("tm", "wordcloud","SnowballC"))
# Load libraries
library(tm)
## Loading required package: NLP
library(wordcloud)
## Loading required package: RColorBrewer
library(SnowballC)
Create a new folder e.g. ~/Desktop/test/ containing a speech.txt file.
# Create a corpus variable
mooncloud <- Corpus(DirSource("C:/Users/112-user/Desktop/test/"))
# Make sure it has loaded properly - have a look!
#inspect(mooncloud)
(Replace start)
# Strip unnecessary whitespace
mooncloud <- tm_map(mooncloud, stripWhitespace)
# Convert to lowercase
mooncloud <- tm_map(mooncloud, tolower)
# Remove conjunctions etc.
mooncloud <- tm_map(mooncloud, removeWords, stopwords("english"))
# Remove suffixes to the common 'stem'
mooncloud <- tm_map(mooncloud, stemDocument)
# Remove commas etc.
mooncloud <- tm_map(mooncloud, removePunctuation)
#(optional) arguments of 'tm' are converting the document to something other than text, to avoid, run this line
mooncloud <- tm_map(mooncloud, PlainTextDocument)
(Replace end)
# Time to generate a wordcloud!
wordcloud(mooncloud
, scale=c(5,0.5) # Set min and max scale
, max.words=100 # Set top n words
, random.order=FALSE # Words in decreasing freq
, rot.per=0.35 # % of vertical words
, use.r.layout=FALSE # Use C++ collision detection
, colors=brewer.pal(8, "Dark2"))
In addition: If you want to analyse texts in Azerbaijan, just replace above code between Replace from and end with following script.
az_stopwords <- c("bir", "və", "ki", "bu", "ilə", "üçün", "da","də","öz","ancaq","hər") # Add more Azerbaijani stopwords
# Define a custom function to remove Azerbaijani stopwords
removeAzStopwords <- function(text) {
text <- tolower(text) # Convert text to lowercase for case insensitivity
words <- unlist(strsplit(text, " ")) # Tokenize the text into words
words <- words[!words %in% az_stopwords] # Remove Azerbaijani stopwords
cleaned_text <- paste(words, collapse = " ") # Recreate the cleaned text
return(cleaned_text)
}
# Apply the custom function to your Corpus
mooncloud <- tm_map(mooncloud, content_transformer(removeAzStopwords))
Source: https://lukesingham.com/how-to-make-a-word-cloud-using-r/