Introduction

The final goal is to develop a predictive model for text. The first step to achieve this, is to analyze the distributions and relationships between the words, tokens and phrases in a text. To this end, an exploratory data analysis will be performed on English corpora coming from blogs, news or twitter. This will give us a better understanding of the word and word pairs frequencies. Questions such as word (unigrams) and word pairs (bigrams, trigrams) frequencies, words per line depending on the document type, as well as how to clean text from unwished words (special characters, profanity words) will be adressed in this first Milestone Report.

Load the data

To begin with, the needed data is downloaded and unzipped in a dedicated folder. The raw data contains several folders, one folder for each language. For the project, only the english data will be used (final/en_US). The english folder has 3 different text files, en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt. Each file represents a text corpus that originates either from blogs, news or twitter messages. A connexion is created for each text file, and the corpora are saved in the following variables by reading each line: corpus_blog, corpus_news, corpus_twitter.

data_folder_name <- "Coursera-SwiftKey.zip"

if (!file.exists(data_folder_name)) {
      data_url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"   
      download.file(data_url, data_folder_name, method="curl")
      unzip(data_folder_name)
}

# Create connexions to the english files
connexion_blog <- file("final/en_US/en_US.blogs.txt", "r")
connexion_twitter <- file("final/en_US/en_US.twitter.txt", "r")
connexion_news <- file("final/en_US/en_US.news.txt", "r")

# Read the lines of the 3 files
corpus_blog <- readLines(connexion_blog, encoding = "UTF-8", skipNul = TRUE)
corpus_twitter <- readLines(connexion_twitter, encoding = "UTF-8", skipNul = TRUE)
corpus_news <- readLines(connexion_news, encoding = "UTF-8", skipNul = TRUE)

# Close the connexions
close(connexion_blog)
close(connexion_twitter)
close(connexion_news)

Exploratory Data Analysis

A first small exploratory data analysis of the whole english data is performed below. More precisely, we are interested in the number of lines, number of words, number of characters, statistical characteristics (mean, max, min…) of the number of words per line for each corpus. Those variables are calculated in the custom function StringAnalysis and are saved in the dataframe analysis_summary_df. Those characteristics are displayed in the table below.

#CountWords <- function(text) {
#      text <- trimws(text)
#      words <- unlist(strsplit(text, "\\s+"))
#      words <- words[nzchar(words)]  # Filter out empty strings
#      return(length(words))
#}

StringAnalysis <- function(text) {
      
      nLines <- length(text)
      #nWords <- CountWords(text)
      nWords <- sum(stri_count_words(text))
      nChars <- sum(nchar(text))
      
      WordsperLine <- stri_count_words(text)
      WordsperLine_summary <- summary(WordsperLine)
      
      result = c(nLines, nWords, nChars, 
                 as.numeric(WordsperLine_summary["Min."]),
                 as.numeric(WordsperLine_summary["Max."]),
                 as.numeric(WordsperLine_summary["Mean"]),
                 as.numeric(WordsperLine_summary["1st Qu."]),
                 as.numeric(WordsperLine_summary["3rd Qu."]))
      
      names(result) <- c("nLines", "nWords", "nChars", 
                         "WordspLines_Min", "WordspLines_Max", 
                         "WordspLines_Mean", "WordspLines_1stQ", 
                         "WordspLines_3rdQ")
      
      return(result)
}

#analysis_summary1 <- cbind(StringAnalysis(corpus_blog), StringAnalysis(corpus_news), StringAnalysis(corpus_twitter))
analysis_summary <- sapply(list(corpus_blog, corpus_news, corpus_twitter), StringAnalysis)
colnames(analysis_summary) <- c("corpus_blog", "corpus_news", "corpus_twitter")
analysis_summary_df <- as.data.frame(analysis_summary, stringsAsFactors = FALSE)

print(analysis_summary)
##                   corpus_blog  corpus_news corpus_twitter
## nLines           8.992880e+05 1.010242e+06   2.360148e+06
## nWords           3.754625e+07 3.476240e+07   3.009341e+07
## nChars           2.068245e+08 2.032232e+08   1.620962e+08
## WordspLines_Min  0.000000e+00 1.000000e+00   1.000000e+00
## WordspLines_Max  6.726000e+03 1.796000e+03   4.700000e+01
## WordspLines_Mean 4.175109e+01 3.440997e+01   1.275065e+01
## WordspLines_1stQ 9.000000e+00 1.900000e+01   7.000000e+00
## WordspLines_3rdQ 6.000000e+01 4.600000e+01   1.800000e+01

In addition, we also want to have a better look at the distribution of the number of words for each line, per type of corpus. The number of words for each line is saved in the variable Wordsperline using the function stri_count_words. The 3 distributions are then plotted as histograms as seen below: the light blue distribution for the blog corpus, the light green for the news corpus and light coral for the twitter corpus.

WordsperLine <- sapply(list(corpus_blog, corpus_news, corpus_twitter), stri_count_words)

# Create histograms showcasing the words per line frequency for each type of corpus
par(mfrow = c(1,3))

hist(WordsperLine[[1]], 
     main = "Blog", 
     xlab = "Words per Line", 
     ylab = "Frequency", 
     col = "lightblue", 
     border = "black",
     breaks = 40, 
     probability = TRUE)

hist(WordsperLine[[2]], 
     main = "News", 
     xlab = "Words per Line", 
     ylab = "Frequency", 
     col = "lightgreen", 
     border = "black",
     breaks = 40, 
     probability = TRUE)

hist(WordsperLine[[3]], 
     main = "Twitter", 
     xlab = "Words per Line", 
     ylab = "Frequency", 
     col = "lightcoral", 
     border = "black",
     breaks = 40, 
     probability = TRUE)

As one could expect, the twitter distribution is very narrow due to the small word limit allowed for each tweet. The maximum number of words per line is equal to 47, as can be seen in the dataframe summary. The peak of the distribution is situated at approx. 10 words per line. On the other hand, the number of words per line is higher for texts and blogs. The mean for each is respectively 34 and 42, still making the distributions narrow and close to the origin. But the maximum of words per line is much higher than for the twitter corpus, respectively 1796 and 6726.

To have a better visual representation of the distribution, the histograms were replotted below by taking the logarithmic value of the words per line variable.

WordsperLine <- sapply(list(corpus_blog, corpus_news, corpus_twitter), stri_count_words)

# Create histograms showcasing the words per line frequency for each type of corpus
par(mfrow = c(1,3))

hist(log(WordsperLine[[1]]), 
     main = "Blog", 
     xlab = "Log of the Words per Line", 
     ylab = "Frequency", 
     col = "lightblue", 
     border = "black",
     breaks = 40, 
     probability = TRUE)

hist(log(WordsperLine[[2]]), 
     main = "News", 
     xlab = "Log of the Words per Line", 
     ylab = "Frequency", 
     col = "lightgreen", 
     border = "black",
     breaks = 40, 
     probability = TRUE)

hist(log(WordsperLine[[3]]), 
     main = "Twitter", 
     xlab = "Log of the Words per Line", 
     ylab = "Frequency", 
     col = "lightcoral", 
     border = "black",
     breaks = 40, 
     probability = TRUE)

Sampling and cleaning

For the rest of this work, only a small sample of the corpus will be used. This will make the processing of the data a lot faster, while still using enough data to achieve the desired result. The sample size is chosen to be 5% of the mean of the total number of lines of each corpus type, divided by 3. A sample of the chosen size is then taken for each type and saved in sample_blog, sample_news, sample_twitter. This way, each sample will have the same number of sentences.

The 3 samples are combined in sample_corpus. Non-ASCII characters are removed (i.e. accented letters, special symbols, non-Latin characters) from the corpus before saving it as a txt file under en_US.sample_corpus.txt.

# Sampling the Data
set.seed(48)

sample_size = mean(analysis_summary[1,])/3 * 0.05

sample_blog <- sample(corpus_blog, sample_size, replace = FALSE)
sample_news <- sample(corpus_news, sample_size, replace = FALSE)
sample_twitter <- sample(corpus_twitter, sample_size, replace = FALSE)

# Combine them
sample_corpus <- c(sample_blog, sample_news, sample_twitter)

# Remove non-ASCII characters (accented letters, special symbols, non-Latin 
# alphabets or characters)
sample_corpus <- iconv(sample_corpus, "latin1", "ASCII", sub = "")

# Save the sample in a file
sample_file_name <- "en_US.sample_corpus.txt"
connexion_writing <- file(sample_file_name, open = "w")
writeLines(sample_corpus, connexion_writing)
close(connexion_writing)

The next step is to clean the text data. For example, profanity words can be present in the sample corpus, which is unwanted. To filter them out, a profanity list of prohibited words profanity_list is created combining lexicons that are already implemented in R (profanity_alvarez, profanity_banned, profanity_arr_bad) as well as a list of bad words created by https://www.cs.cmu.edu and saved at the following link https://www.cs.cmu.edu/~biglou/resources/bad-words.txt.

# Cleaning the sample corpus
profanity_list <- unique(c(lexicon::profanity_alvarez, 
                           lexicon::profanity_banned, 
                           lexicon::profanity_arr_bad))

url_badwords <- "https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
profanity_list <- unique(c(profanity_list, readLines(url_badwords))) 

Further cleaning steps are implemented in a custom function called CleaningCorpus, which takes a large character variable as input. The applied preprocessing steps are as follows:

CleaningCorpus <- function(text) {
      text_corpus <- corpus(text)
      
      text_tokens <- tokens(text_corpus,
                            what="word1",
                            remove_punct = TRUE,
                            remove_numbers = TRUE,
                            remove_separators = TRUE,
                            remove_symbols = TRUE,
                            remove_url =TRUE)
      
      text_tokens_nostopw <- tokens_remove(text_tokens, pattern = stopwords("en"))
      
      
      return(text_tokens)
} 

sample_tokens <- CleaningCorpus(sample_corpus)
sample_tokens_clean <- tokens_remove(sample_tokens, pattern = profanity_list)
sample_tokens_nsw <- tokens_remove(sample_tokens_clean, pattern = stopwords("en"))

Unigrams

Now that the data was put into the form of tokens, it can be analyzed with more ease. One important aspect that we need to inspect in this work, is to see how different words relate to each other, i.e. what kind of word would follow knowing the word that precedes. A good way to study this is to look at the n-grams. An n-gram represents a sequence of n adjacent tokens. In this report, unigrams, bigrams and trigrams will be studied in more details, i.e. respectively n-grams of size 1, 2 and 3 tokens.

For this purpose, the quanteda library is used, which is a R package that provides a fast and flexible framework for quantitative text analysis. In particular, the data can be put into the form of a document-feature matrix, or dfm matrix. In this format, rows indicate documents and columns “features”, where a feature is a token.

We start by analyzing the unigrams in more detail.

unigrams <- tokens_ngrams(sample_tokens, n = 1, concatenator = " ")
unigrams_nsw <- tokens_ngrams(sample_tokens_nsw, n = 1, concatenator = " ")

matrix_unigrams <- dfm(unigrams, 
                    tolower = TRUE,
                    remove_padding = TRUE)

matrix_unigrams_nsw <- dfm(unigrams_nsw, 
                      tolower = TRUE,
                      remove_padding = TRUE)

It is now easy to extract from the matrix the top 10 most present unigrams (or words) in the corpus, which is here done for both the matrix with or without stop words. Unsurprisingly, when stop words are present in the analysis, the top 10 are only made of stop words. Excluding them, the most common words are respectively said, one, just, can and like. The top 100 most common unigrams are saved in a dataframe (with or without stop words), which will later be used to plot some frequency histograms.

Another characteristic that can be investigated is the number of unigrams needed to make up 50% and 90% of the word frequency. With stop words, this number is equal to 142 for 50% and 8335 for 90%. Without stop words, this number increases to 1110 for 50% and 17669 for 90%. This shows us that stop words do make up a very large part of the corpus when they are included. Another interesting observation is that the number of words to make up a certain frequency is not proportional, i.e. the number of words needed to make up 50% of the corpus is 16 times smaller than for 90% (without stop words case).

print(topfeatures(matrix_unigrams, 10))
##    the     to    and      a     of     in      i   that    for     is 
## 103873  57081  53853  50265  44689  35271  31202  22509  21698  21523
print(topfeatures(matrix_unigrams_nsw, 10))
##   said    one   just    can   like   time    get    new    now people 
##   6993   6032   5463   4931   4903   4281   3970   3814   3296   3219
top100_unigrams_nsw <- topfeatures(matrix_unigrams_nsw, 100)
top100_unigrams_nsw_df <- data.frame(unigram = names(top100_unigrams_nsw), frequency = top100_unigrams_nsw)

top100_unigrams <- topfeatures(matrix_unigrams, 100)
top100_unigrams_df <- data.frame(unigram = names(top100_unigrams), frequency = top100_unigrams)

unigram_freqs <- colSums(matrix_unigrams)
sorted_unigram_freqs <- sort(unigram_freqs, decreasing = TRUE)
unigram_percentage <- sorted_unigram_freqs / sum(unigram_freqs) * 100
cumulative_percentage <- cumsum(unigram_percentage)
num_unigrams_50 <- which(cumulative_percentage >= 50)[1]
num_unigrams_90 <- which(cumulative_percentage >= 90)[1]

cat("Number of unigrams needed to make up 50% of the word frequency:", num_unigrams_50, "\n")
## Number of unigrams needed to make up 50% of the word frequency: 142
cat("Number of unigrams needed to make up 90% of the word frequency:", num_unigrams_90, "\n")
## Number of unigrams needed to make up 90% of the word frequency: 8335
unigram_freqs_nsw <- colSums(matrix_unigrams_nsw)
sorted_unigram_freqs_nsw <- sort(unigram_freqs_nsw, decreasing = TRUE)
unigram_percentage_nsw <- sorted_unigram_freqs_nsw / sum(unigram_freqs_nsw) * 100
cumulative_percentage_nsw <- cumsum(unigram_percentage_nsw)
num_unigrams_50_nsw <- which(cumulative_percentage_nsw >= 50)[1]
num_unigrams_90_nsw <- which(cumulative_percentage_nsw >= 90)[1]

cat("Number of unigrams needed to make up 50% of the word frequency (no stop words):", num_unigrams_50_nsw, "\n")
## Number of unigrams needed to make up 50% of the word frequency (no stop words): 1110
cat("Number of unigrams needed to make up 90% of the word frequency (no stop words):", num_unigrams_90_nsw, "\n")
## Number of unigrams needed to make up 90% of the word frequency (no stop words): 17669

Bigrams

A similar analysis is performed for bigrams (2 adjacent tokens). The most common bigrams are of the (9782), in the (8903), to the (4622), on the (4196) and for the (3843) when including stop words. Without stop words, the most common bigrams are new york (470), last year (406), right now (356), high school (320), and years ago (312). The top 100 are saved in their own dataframe top100_bigrams_df (with stop words) and top100_bigrams_nsw_df (without stop words).

bigrams <- tokens_ngrams(sample_tokens, n = 2, concatenator = " ")
bigrams_nsw <- tokens_ngrams(sample_tokens_nsw, n = 2, concatenator = " ")

matrix_bigrams <- dfm(bigrams, 
                      tolower = TRUE,
                      remove_padding = TRUE)

matrix_bigrams_nsw <- dfm(bigrams_nsw, 
                          tolower = TRUE,
                          remove_padding = TRUE)

print(topfeatures(matrix_bigrams, 10))
##   of the   in the   to the   on the  for the    to be   at the  and the 
##     9782     8903     4622     4196     3843     3260     2921     2819 
##     in a with the 
##     2641     2364
print(topfeatures(matrix_bigrams_nsw, 10))
##    new york   last year   right now high school   years ago   last week 
##         470         406         356         320         312         268 
##  first time even though    st louis   feel like 
##         248         218         217         214
top100_bigrams_nsw <- topfeatures(matrix_bigrams_nsw, 100)
top100_bigrams_nsw_df <- data.frame(bigram = names(top100_bigrams_nsw), frequency = top100_bigrams_nsw)

top100_bigrams <- topfeatures(matrix_bigrams, 100)
top100_bigrams_df <- data.frame(bigram = names(top100_bigrams), frequency = top100_bigrams)

Trigams

The code is repeated a third time for trigrams (3 adjacent tokens). In this case, the most common trigrams are one of the (803), a lot of (610), it was a (351), to be a (331), out of the (316) when including stop words, and new york city (56), two years ago (38), new york times (38), president barack obama (34) and let us know (32). Again, the top 100 are saved in the dataframe top100_trigrams_df and top100_trigrams_nsw_df.

trigrams <- tokens_ngrams(sample_tokens, n = 3, concatenator = " ")
trigrams_nsw <- tokens_ngrams(sample_tokens_nsw, n = 3, concatenator = " ")

matrix_trigrams <- dfm(trigrams, 
                      tolower = TRUE,
                      remove_padding = TRUE)

matrix_trigrams_nsw <- dfm(trigrams_nsw, 
                          tolower = TRUE,
                          remove_padding = TRUE)

print(topfeatures(matrix_trigrams, 10))
##  one of the    a lot of    it was a     to be a  out of the  the end of 
##         803         610         351         331         316         310 
##   i want to some of the  as well as part of the 
##         301         299         295         284
print(topfeatures(matrix_trigrams_nsw, 10))
##          new york city          two years ago         new york times 
##                     56                     38                     38 
## president barack obama            let us know          two weeks ago 
##                     34                     32                     29 
##           world war ii         happy new year       first time since 
##                     25                     25                     22 
##        st louis county 
##                     22
top100_trigrams_nsw <- topfeatures(matrix_trigrams_nsw, 100)
top100_trigrams_nsw_df <- data.frame(trigram = names(top100_trigrams_nsw), frequency = top100_trigrams_nsw)

top100_trigrams <- topfeatures(matrix_trigrams, 100)
top100_trigrams_df <- data.frame(trigram = names(top100_trigrams), frequency = top100_trigrams)

Plots

The results obtained previously are summarized in the next plots. Each graphic shows the top 20 n-grams on 3 different histograms : the blue frequency histogram for unigrams, green for bigrams and orange for trigrams. A version of the graphic exists with and without stop words.

As one already saw previously, when stop words are included, the most common n-grams are almost only made of them. Their frequency is also higher than the frequency of non stop words. On the other hand, analysis without stop words is a lot more interesting, since the variety of the n-grams is higher, and gives context to the text.

plot_unigrams_nsw <- ggplot(head(top100_unigrams_nsw_df, 20), aes(x = reorder(unigram, -frequency), y = frequency)) +
      geom_bar(stat = "identity", fill = "lightblue") +
      labs(title = "Top 20 Unigrams (without stop words)", x = "Unigram", y = "Frequency") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))

plot_bigrams_nsw <- ggplot(head(top100_bigrams_nsw_df, 20), aes(x = reorder(bigram, -frequency), y = frequency)) +
      geom_bar(stat = "identity", fill = "lightgreen") +
      labs(title = "Top 20 Bigrams (without stop words)", x = "Bigram", y = "Frequency") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))

plot_trigrams_nsw <- ggplot(head(top100_trigrams_nsw_df, 20), aes(x = reorder(trigram, -frequency), y = frequency)) +
      geom_bar(stat = "identity", fill = "coral") +
      labs(title = "Top 20 Trigrams (without stop words)", x = "Trigram", y = "Frequency") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))

grid.arrange(plot_unigrams_nsw, plot_bigrams_nsw, plot_trigrams_nsw, ncol = 1)

plot_unigrams <- ggplot(head(top100_unigrams_df, 20), aes(x = reorder(unigram, -frequency), y = frequency)) +
      geom_bar(stat = "identity", fill = "lightblue") +
      labs(title = "Top 20 Unigrams (with stop words)", x = "Unigram", y = "Frequency") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))

plot_bigrams <- ggplot(head(top100_bigrams_df, 20), aes(x = reorder(bigram, -frequency), y = frequency)) +
      geom_bar(stat = "identity", fill = "lightgreen") +
      labs(title = "Top 20 Bigrams (with stop words)", x = "Bigram", y = "Frequency") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))

plot_trigrams <- ggplot(head(top100_trigrams_df, 20), aes(x = reorder(trigram, -frequency), y = frequency)) +
      geom_bar(stat = "identity", fill = "coral") +
      labs(title = "Top 20 Trigrams (with stop words)", x = "Trigram", y = "Frequency") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))

grid.arrange(plot_unigrams, plot_bigrams, plot_trigrams, ncol = 1)

Conclusion

The Exploratory Data Analysis done in this report gave us interesting insight in characteristics such as the number of words per line depending on the type of document (blog, news or twitter messages). For example, we learned that the number of words per line is very small in the case of twitter message (below 40) compared to blog and news documents. The corpora were also sampled and cleaned, and relationships between words were investigated using the quanteda library and transforming the strings into n-grams. We saw the importance of filtering out stop words to get relevant information about the context of the text, and we studied some of the top most common uni-, bi-, and trigrams. For the creation of a Shiny App that will allow a user to enter a word and give a predictive suggestion for the next possible word, bigrams and trigrams (and maybe even further n-grams) seem to be of utmost importance, since those give us an information of possible future word choices depending on the already written words. As a next step, one could expand on further n-grams, taking into account stop words because of their prevalence, and give them different weights.