Loading the Data

In this task the goal is to understand the basics relationships of the words in the data. For this it will be built figures with the most frequent words and word pairs in the data. The data to be analyzed was obtained at Coursera website.

There are three datasets: ‘en_US_twitter.txt’ with data collected from Twitter; ‘en_US_blogs.txt’ with data collected from blogs; and ‘en_US_news.txt’ with data collected from news. Loading these data sets to the workspace:

twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8")
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8")
news <- readLines("en_US.news.txt", encoding = "UTF-8")

In order to make the processing of the data fast, a sample of 5000 elements of each dataset was generated and aggregated in the variable teste:

set.seed(2020)
twitter.s <- twitter[sample(1:length(twitter),5000)]
blogs.s <- blogs[sample(1:length(blogs),5000)]
news.s <- news[sample(1:length(news),5000)]
teste <- c(twitter.s, blogs.s, news.s)

Cleaning the Data

Let’s start the cleaning data process removing quotation marks and commas:

teste <- gsub('“', '', teste)
teste <- gsub("”", "", teste)
teste <- gsub(",", "", teste)

Now the data will be converted to a ‘Corpus’, that is a document containing the natural language text. Then the punctuation, spaces and numbers will be removed. Using text steamming it is possible to remove derivations and multiples of the same word. Also is desired to remove the common words (like ‘and’, ‘no’, ‘that’, ‘with’,…) and make all words lower case:

#corpus
teste.corpus <- Corpus(VectorSource(teste))

#cleaning data: lower case, removing stopwords, removing ponctuation and numbers, stripping spaces 
#and removing derivations and multiple words
teste.corpus <- teste.corpus %>% tm_map(content_transformer(tolower)) %>% 
  tm_map(removeWords, words = c(stopwords("english"), "s", "ve")) %>% 
  tm_map(removePunctuation) %>% tm_map(removeNumbers) %>% 
  tm_map(stripWhitespace) %>% tm_map(stemDocument)

Exploratory Data Analysis

In this section we are going to identify which are the most frequent words e/or group of words in the data.

The N-grams are going to be created in this step. First the corpus data will be converted in dataframe; after that it will be used the function NGramTokenizer to generate 1-, 2-, 3- and 4-grams:

#transforming the corpus data into a data frame
teste.df <- data.frame(text = sapply(teste.corpus, as.character), stringsAsFactors = FALSE)
#creating the N-grams
uniteste <- NGramTokenizer(teste.df, Weka_control(min = 1, max = 1))
biteste <- NGramTokenizer(teste.df, Weka_control(min = 2, max = 2))
triteste <- NGramTokenizer(teste.df, Weka_control(min = 3, max = 3))
quadriteste <- NGramTokenizer(teste.df, Weka_control(min = 4, max = 4))

Determining the frequencies for each N-gram:

#frequency 1-gram
unifreq.df <- data.frame(table(uniteste))
unifreq.df <- unifreq.df[order(unifreq.df$Freq, decreasing = TRUE),]
colnames(unifreq.df) <- c("word", "freq")
#frequency 2-gram
bifreq.df <- data.frame(table(biteste))
bifreq.df <- bifreq.df[order(bifreq.df$Freq, decreasing = TRUE),]
colnames(bifreq.df) <- c("words", "freq")
#frequency 3-gram
trifreq.df <- data.frame(table(triteste))
trifreq.df <- trifreq.df[order(trifreq.df$Freq, decreasing = TRUE),]
colnames(trifreq.df) <- c("words", "freq")
#frequency 4-gram
quadrifreq.df <- data.frame(table(quadriteste))
quadrifreq.df <- quadrifreq.df[order(quadrifreq.df$Freq, decreasing = TRUE),]
colnames(quadrifreq.df) <- c("words", "freq")

For the N-grams created, we areg oing to visualize the frequency looking to a wordcloud and a bar graph. In wordcloud representation, the size of the word it is directly related to its frequency. Starting with 1-gram:

#1-gram wordcloud
set.seed(2020)
wordcloud2(data = unifreq.df, size = 0.5, shape = "pentagon")

Plotting the 10 most frequent words in a bar graph:

ggplot(head(unifreq.df,10), aes(reorder(word,freq), freq)) +
  geom_bar(stat = "identity", fill = rainbow(10), col = rainbow(10)) + coord_flip() +
  xlab("Unigrams") + ylab("Frequency") + ggtitle("Most Frequent Unigrams")

For the 2-grams:

#2-gram wordcloud
set.seed(2020)
wordcloud(words = bifreq.df$word, freq = bifreq.df$freq, max.words=30,
          random.order=T,colors=brewer.pal(8, "Dark2"))

Plotting the 10 most frequent bigram in a bar plot:

ggplot(head(bifreq.df,10), aes(reorder(words,freq), freq)) +
  geom_bar(stat = "identity", fill = rainbow(10), col = rainbow(10)) + coord_flip() +
  xlab("Bigrams") + ylab("Frequency") + ggtitle("Most Frequent Bigrams")

For the 3-grams:

#3-gram wordcloud
set.seed(2020)
wordcloud(words = trifreq.df$word, freq = trifreq.df$freq, max.words=30,
          random.order=T,colors=brewer.pal(8, "Dark2"))

Plooting the 10 most trigrams in a bar plot:

ggplot(head(trifreq.df,10), aes(reorder(words,freq), freq)) +
  geom_bar(stat = "identity", fill = rainbow(10), col = rainbow(10)) + coord_flip() +
  xlab("Trigrams") + ylab("Frequency") + ggtitle("Most Frequent Trigrams")

For the 4-grams:

#4-gram wordcloud
set.seed(2020)
wordcloud(words = quadrifreq.df$word, freq = quadrifreq.df$freq, max.words=15,
          random.order=T,colors=brewer.pal(8, "Dark2"))

A bar plot with the 10 most frequent quadrigrams:

ggplot(head(quadrifreq.df,10), aes(reorder(words,freq), freq)) +
  geom_bar(stat = "identity", fill = rainbow(10), col = rainbow(10)) + coord_flip() +
  xlab("Quadrigrams") + ylab("Frequency") + ggtitle("Most Frequent Quadrigrams")

Further Actions

As a further step a model will be created and integrated into a Shiny app for word prediction.

Unitl now, the corpus was converted to N-grams, stored in Document Term Matrix (DTM) and then converted to data frames of frequencies. This should be useful for predicting the next word in a sequence of words. For example, when looking at a string of 3 words the most likely next word can be guessed by investigating all 4-grams starting with these three words and chosing the most frequent one.

The plan for Shiny application is to create a simple application where the user can enter a string of text. and the prediction model will return a list of suggested words to update the next word.