The final goal is to develop a predictive model for text. The first step to achieve this, is to analyze the distributions and relationships between the words, tokens and phrases in a text. To this end, an exploratory data analysis will be performed on English corpora coming from blogs, news or twitter. This will give us a better understanding of the word and word pairs frequencies. Questions such as word (unigrams) and word pairs (bigrams, trigrams) frequencies, words per line depending on the document type, as well as how to clean text from unwished words (special characters, profanity words) will be adressed in this first Milestone Report.
To begin with, the needed data is downloaded and unzipped in a
dedicated folder. The raw data contains several folders, one folder for
each language. For the project, only the english data will be used
(final/en_US). The english folder has 3 different text
files, en_US.blogs.txt, en_US.news.txt and
en_US.twitter.txt. Each file represents a text corpus that
originates either from blogs, news or twitter messages. A connexion is
created for each text file, and the corpora are saved in the following
variables by reading each line: corpus_blog,
corpus_news, corpus_twitter.
data_folder_name <- "Coursera-SwiftKey.zip"
if (!file.exists(data_folder_name)) {
data_url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(data_url, data_folder_name, method="curl")
unzip(data_folder_name)
}
# Create connexions to the english files
connexion_blog <- file("final/en_US/en_US.blogs.txt", "r")
connexion_twitter <- file("final/en_US/en_US.twitter.txt", "r")
connexion_news <- file("final/en_US/en_US.news.txt", "r")
# Read the lines of the 3 files
corpus_blog <- readLines(connexion_blog, encoding = "UTF-8", skipNul = TRUE)
corpus_twitter <- readLines(connexion_twitter, encoding = "UTF-8", skipNul = TRUE)
corpus_news <- readLines(connexion_news, encoding = "UTF-8", skipNul = TRUE)
# Close the connexions
close(connexion_blog)
close(connexion_twitter)
close(connexion_news)
A first small exploratory data analysis of the whole english data is
performed below. More precisely, we are interested in the number of
lines, number of words, number of characters, statistical
characteristics (mean, max, min…) of the number of words per line for
each corpus. Those variables are calculated in the custom function
StringAnalysis and are saved in the dataframe
analysis_summary_df. Those characteristics are displayed in
the table below.
#CountWords <- function(text) {
# text <- trimws(text)
# words <- unlist(strsplit(text, "\\s+"))
# words <- words[nzchar(words)] # Filter out empty strings
# return(length(words))
#}
StringAnalysis <- function(text) {
nLines <- length(text)
#nWords <- CountWords(text)
nWords <- sum(stri_count_words(text))
nChars <- sum(nchar(text))
WordsperLine <- stri_count_words(text)
WordsperLine_summary <- summary(WordsperLine)
result = c(nLines, nWords, nChars,
as.numeric(WordsperLine_summary["Min."]),
as.numeric(WordsperLine_summary["Max."]),
as.numeric(WordsperLine_summary["Mean"]),
as.numeric(WordsperLine_summary["1st Qu."]),
as.numeric(WordsperLine_summary["3rd Qu."]))
names(result) <- c("nLines", "nWords", "nChars",
"WordspLines_Min", "WordspLines_Max",
"WordspLines_Mean", "WordspLines_1stQ",
"WordspLines_3rdQ")
return(result)
}
#analysis_summary1 <- cbind(StringAnalysis(corpus_blog), StringAnalysis(corpus_news), StringAnalysis(corpus_twitter))
analysis_summary <- sapply(list(corpus_blog, corpus_news, corpus_twitter), StringAnalysis)
colnames(analysis_summary) <- c("corpus_blog", "corpus_news", "corpus_twitter")
analysis_summary_df <- as.data.frame(analysis_summary, stringsAsFactors = FALSE)
print(analysis_summary)
## corpus_blog corpus_news corpus_twitter
## nLines 8.992880e+05 1.010242e+06 2.360148e+06
## nWords 3.754625e+07 3.476240e+07 3.009341e+07
## nChars 2.068245e+08 2.032232e+08 1.620962e+08
## WordspLines_Min 0.000000e+00 1.000000e+00 1.000000e+00
## WordspLines_Max 6.726000e+03 1.796000e+03 4.700000e+01
## WordspLines_Mean 4.175109e+01 3.440997e+01 1.275065e+01
## WordspLines_1stQ 9.000000e+00 1.900000e+01 7.000000e+00
## WordspLines_3rdQ 6.000000e+01 4.600000e+01 1.800000e+01
In addition, we also want to have a better look at the distribution
of the number of words for each line, per type of corpus. The number of
words for each line is saved in the variable Wordsperline
using the function stri_count_words. The 3 distributions
are then plotted as histograms as seen below: the light blue
distribution for the blog corpus, the light green for the news corpus
and light coral for the twitter corpus.
WordsperLine <- sapply(list(corpus_blog, corpus_news, corpus_twitter), stri_count_words)
# Create histograms showcasing the words per line frequency for each type of corpus
par(mfrow = c(1,3))
hist(WordsperLine[[1]],
main = "Blog",
xlab = "Words per Line",
ylab = "Frequency",
col = "lightblue",
border = "black",
breaks = 40,
probability = TRUE)
hist(WordsperLine[[2]],
main = "News",
xlab = "Words per Line",
ylab = "Frequency",
col = "lightgreen",
border = "black",
breaks = 40,
probability = TRUE)
hist(WordsperLine[[3]],
main = "Twitter",
xlab = "Words per Line",
ylab = "Frequency",
col = "lightcoral",
border = "black",
breaks = 40,
probability = TRUE)
As one could expect, the twitter distribution is very narrow due to the small word limit allowed for each tweet. The maximum number of words per line is equal to 47, as can be seen in the dataframe summary. The peak of the distribution is situated at approx. 10 words per line. On the other hand, the number of words per line is higher for texts and blogs. The mean for each is respectively 34 and 42, still making the distributions narrow and close to the origin. But the maximum of words per line is much higher than for the twitter corpus, respectively 1796 and 6726.
To have a better visual representation of the distribution, the histograms were replotted below by taking the logarithmic value of the words per line variable.
WordsperLine <- sapply(list(corpus_blog, corpus_news, corpus_twitter), stri_count_words)
# Create histograms showcasing the words per line frequency for each type of corpus
par(mfrow = c(1,3))
hist(log(WordsperLine[[1]]),
main = "Blog",
xlab = "Log of the Words per Line",
ylab = "Frequency",
col = "lightblue",
border = "black",
breaks = 40,
probability = TRUE)
hist(log(WordsperLine[[2]]),
main = "News",
xlab = "Log of the Words per Line",
ylab = "Frequency",
col = "lightgreen",
border = "black",
breaks = 40,
probability = TRUE)
hist(log(WordsperLine[[3]]),
main = "Twitter",
xlab = "Log of the Words per Line",
ylab = "Frequency",
col = "lightcoral",
border = "black",
breaks = 40,
probability = TRUE)
For the rest of this work, only a small sample of the corpus will be
used. This will make the processing of the data a lot faster, while
still using enough data to achieve the desired result. The sample size
is chosen to be 5% of the mean of the total number of lines of each
corpus type, divided by 3. A sample of the chosen size is then taken for
each type and saved in sample_blog,
sample_news, sample_twitter. This way, each
sample will have the same number of sentences.
The 3 samples are combined in sample_corpus. Non-ASCII
characters are removed (i.e. accented letters, special symbols,
non-Latin characters) from the corpus before saving it as a
txt file under en_US.sample_corpus.txt.
# Sampling the Data
set.seed(48)
sample_size = mean(analysis_summary[1,])/3 * 0.05
sample_blog <- sample(corpus_blog, sample_size, replace = FALSE)
sample_news <- sample(corpus_news, sample_size, replace = FALSE)
sample_twitter <- sample(corpus_twitter, sample_size, replace = FALSE)
# Combine them
sample_corpus <- c(sample_blog, sample_news, sample_twitter)
# Remove non-ASCII characters (accented letters, special symbols, non-Latin
# alphabets or characters)
sample_corpus <- iconv(sample_corpus, "latin1", "ASCII", sub = "")
# Save the sample in a file
sample_file_name <- "en_US.sample_corpus.txt"
connexion_writing <- file(sample_file_name, open = "w")
writeLines(sample_corpus, connexion_writing)
close(connexion_writing)
The next step is to clean the text data. For example, profanity words
can be present in the sample corpus, which is unwanted. To filter them
out, a profanity list of prohibited words profanity_list is
created combining lexicons that are already implemented in R
(profanity_alvarez, profanity_banned,
profanity_arr_bad) as well as a list of bad words created
by https://www.cs.cmu.edu and saved at the following link
https://www.cs.cmu.edu/~biglou/resources/bad-words.txt.
# Cleaning the sample corpus
profanity_list <- unique(c(lexicon::profanity_alvarez,
lexicon::profanity_banned,
lexicon::profanity_arr_bad))
url_badwords <- "https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
profanity_list <- unique(c(profanity_list, readLines(url_badwords)))
Further cleaning steps are implemented in a custom function called
CleaningCorpus, which takes a large character variable as
input. The applied preprocessing steps are as follows:
text_tokens with the
tokens function. At this step, the unwanted punctuations,
numbers, separators, symbols and url links are eliminated.CleaningCorpus is called upon
sample_corpus, giving tokens_corpus as
output.tokens_remove function as well as the
profanity_list created previously.text_tokens_nsw is introduced.CleaningCorpus <- function(text) {
text_corpus <- corpus(text)
text_tokens <- tokens(text_corpus,
what="word1",
remove_punct = TRUE,
remove_numbers = TRUE,
remove_separators = TRUE,
remove_symbols = TRUE,
remove_url =TRUE)
text_tokens_nostopw <- tokens_remove(text_tokens, pattern = stopwords("en"))
return(text_tokens)
}
sample_tokens <- CleaningCorpus(sample_corpus)
sample_tokens_clean <- tokens_remove(sample_tokens, pattern = profanity_list)
sample_tokens_nsw <- tokens_remove(sample_tokens_clean, pattern = stopwords("en"))
Now that the data was put into the form of tokens, it can be analyzed with more ease. One important aspect that we need to inspect in this work, is to see how different words relate to each other, i.e. what kind of word would follow knowing the word that precedes. A good way to study this is to look at the n-grams. An n-gram represents a sequence of n adjacent tokens. In this report, unigrams, bigrams and trigrams will be studied in more details, i.e. respectively n-grams of size 1, 2 and 3 tokens.
For this purpose, the quanteda library is used, which is
a R package that provides a fast and flexible framework for quantitative
text analysis. In particular, the data can be put into the form of a
document-feature matrix, or dfm matrix. In this format, rows indicate
documents and columns “features”, where a feature is a token.
We start by analyzing the unigrams in more detail.
unigrams <- tokens_ngrams(sample_tokens, n = 1, concatenator = " ")
unigrams_nsw <- tokens_ngrams(sample_tokens_nsw, n = 1, concatenator = " ")
matrix_unigrams <- dfm(unigrams,
tolower = TRUE,
remove_padding = TRUE)
matrix_unigrams_nsw <- dfm(unigrams_nsw,
tolower = TRUE,
remove_padding = TRUE)
It is now easy to extract from the matrix the top 10 most present
unigrams (or words) in the corpus, which is here done for both the
matrix with or without stop words. Unsurprisingly, when stop words are
present in the analysis, the top 10 are only made of stop words.
Excluding them, the most common words are respectively
said, one, just, can
and like. The top 100 most common unigrams are saved in a
dataframe (with or without stop words), which will later be used to plot
some frequency histograms.
Another characteristic that can be investigated is the number of unigrams needed to make up 50% and 90% of the word frequency. With stop words, this number is equal to 142 for 50% and 8335 for 90%. Without stop words, this number increases to 1110 for 50% and 17669 for 90%. This shows us that stop words do make up a very large part of the corpus when they are included. Another interesting observation is that the number of words to make up a certain frequency is not proportional, i.e. the number of words needed to make up 50% of the corpus is 16 times smaller than for 90% (without stop words case).
print(topfeatures(matrix_unigrams, 10))
## the to and a of in i that for is
## 103873 57081 53853 50265 44689 35271 31202 22509 21698 21523
print(topfeatures(matrix_unigrams_nsw, 10))
## said one just can like time get new now people
## 6993 6032 5463 4931 4903 4281 3970 3814 3296 3219
top100_unigrams_nsw <- topfeatures(matrix_unigrams_nsw, 100)
top100_unigrams_nsw_df <- data.frame(unigram = names(top100_unigrams_nsw), frequency = top100_unigrams_nsw)
top100_unigrams <- topfeatures(matrix_unigrams, 100)
top100_unigrams_df <- data.frame(unigram = names(top100_unigrams), frequency = top100_unigrams)
unigram_freqs <- colSums(matrix_unigrams)
sorted_unigram_freqs <- sort(unigram_freqs, decreasing = TRUE)
unigram_percentage <- sorted_unigram_freqs / sum(unigram_freqs) * 100
cumulative_percentage <- cumsum(unigram_percentage)
num_unigrams_50 <- which(cumulative_percentage >= 50)[1]
num_unigrams_90 <- which(cumulative_percentage >= 90)[1]
cat("Number of unigrams needed to make up 50% of the word frequency:", num_unigrams_50, "\n")
## Number of unigrams needed to make up 50% of the word frequency: 142
cat("Number of unigrams needed to make up 90% of the word frequency:", num_unigrams_90, "\n")
## Number of unigrams needed to make up 90% of the word frequency: 8335
unigram_freqs_nsw <- colSums(matrix_unigrams_nsw)
sorted_unigram_freqs_nsw <- sort(unigram_freqs_nsw, decreasing = TRUE)
unigram_percentage_nsw <- sorted_unigram_freqs_nsw / sum(unigram_freqs_nsw) * 100
cumulative_percentage_nsw <- cumsum(unigram_percentage_nsw)
num_unigrams_50_nsw <- which(cumulative_percentage_nsw >= 50)[1]
num_unigrams_90_nsw <- which(cumulative_percentage_nsw >= 90)[1]
cat("Number of unigrams needed to make up 50% of the word frequency (no stop words):", num_unigrams_50_nsw, "\n")
## Number of unigrams needed to make up 50% of the word frequency (no stop words): 1110
cat("Number of unigrams needed to make up 90% of the word frequency (no stop words):", num_unigrams_90_nsw, "\n")
## Number of unigrams needed to make up 90% of the word frequency (no stop words): 17669
A similar analysis is performed for bigrams (2 adjacent tokens). The
most common bigrams are of the (9782), in the
(8903), to the (4622), on the (4196) and
for the (3843) when including stop words. Without stop
words, the most common bigrams are new york (470),
last year (406), right now (356),
high school (320), and years ago (312). The
top 100 are saved in their own dataframe top100_bigrams_df
(with stop words) and top100_bigrams_nsw_df (without stop
words).
bigrams <- tokens_ngrams(sample_tokens, n = 2, concatenator = " ")
bigrams_nsw <- tokens_ngrams(sample_tokens_nsw, n = 2, concatenator = " ")
matrix_bigrams <- dfm(bigrams,
tolower = TRUE,
remove_padding = TRUE)
matrix_bigrams_nsw <- dfm(bigrams_nsw,
tolower = TRUE,
remove_padding = TRUE)
print(topfeatures(matrix_bigrams, 10))
## of the in the to the on the for the to be at the and the
## 9782 8903 4622 4196 3843 3260 2921 2819
## in a with the
## 2641 2364
print(topfeatures(matrix_bigrams_nsw, 10))
## new york last year right now high school years ago last week
## 470 406 356 320 312 268
## first time even though st louis feel like
## 248 218 217 214
top100_bigrams_nsw <- topfeatures(matrix_bigrams_nsw, 100)
top100_bigrams_nsw_df <- data.frame(bigram = names(top100_bigrams_nsw), frequency = top100_bigrams_nsw)
top100_bigrams <- topfeatures(matrix_bigrams, 100)
top100_bigrams_df <- data.frame(bigram = names(top100_bigrams), frequency = top100_bigrams)
The code is repeated a third time for trigrams (3 adjacent tokens).
In this case, the most common trigrams are one of the
(803), a lot of (610), it was a (351),
to be a (331), out of the (316) when including
stop words, and new york city (56),
two years ago (38), new york times (38),
president barack obama (34) and let us know
(32). Again, the top 100 are saved in the dataframe
top100_trigrams_df and
top100_trigrams_nsw_df.
trigrams <- tokens_ngrams(sample_tokens, n = 3, concatenator = " ")
trigrams_nsw <- tokens_ngrams(sample_tokens_nsw, n = 3, concatenator = " ")
matrix_trigrams <- dfm(trigrams,
tolower = TRUE,
remove_padding = TRUE)
matrix_trigrams_nsw <- dfm(trigrams_nsw,
tolower = TRUE,
remove_padding = TRUE)
print(topfeatures(matrix_trigrams, 10))
## one of the a lot of it was a to be a out of the the end of
## 803 610 351 331 316 310
## i want to some of the as well as part of the
## 301 299 295 284
print(topfeatures(matrix_trigrams_nsw, 10))
## new york city two years ago new york times
## 56 38 38
## president barack obama let us know two weeks ago
## 34 32 29
## world war ii happy new year first time since
## 25 25 22
## st louis county
## 22
top100_trigrams_nsw <- topfeatures(matrix_trigrams_nsw, 100)
top100_trigrams_nsw_df <- data.frame(trigram = names(top100_trigrams_nsw), frequency = top100_trigrams_nsw)
top100_trigrams <- topfeatures(matrix_trigrams, 100)
top100_trigrams_df <- data.frame(trigram = names(top100_trigrams), frequency = top100_trigrams)
The results obtained previously are summarized in the next plots. Each graphic shows the top 20 n-grams on 3 different histograms : the blue frequency histogram for unigrams, green for bigrams and orange for trigrams. A version of the graphic exists with and without stop words.
As one already saw previously, when stop words are included, the most common n-grams are almost only made of them. Their frequency is also higher than the frequency of non stop words. On the other hand, analysis without stop words is a lot more interesting, since the variety of the n-grams is higher, and gives context to the text.
plot_unigrams_nsw <- ggplot(head(top100_unigrams_nsw_df, 20), aes(x = reorder(unigram, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "lightblue") +
labs(title = "Top 20 Unigrams (without stop words)", x = "Unigram", y = "Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
plot_bigrams_nsw <- ggplot(head(top100_bigrams_nsw_df, 20), aes(x = reorder(bigram, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "lightgreen") +
labs(title = "Top 20 Bigrams (without stop words)", x = "Bigram", y = "Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
plot_trigrams_nsw <- ggplot(head(top100_trigrams_nsw_df, 20), aes(x = reorder(trigram, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "coral") +
labs(title = "Top 20 Trigrams (without stop words)", x = "Trigram", y = "Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
grid.arrange(plot_unigrams_nsw, plot_bigrams_nsw, plot_trigrams_nsw, ncol = 1)
plot_unigrams <- ggplot(head(top100_unigrams_df, 20), aes(x = reorder(unigram, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "lightblue") +
labs(title = "Top 20 Unigrams (with stop words)", x = "Unigram", y = "Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
plot_bigrams <- ggplot(head(top100_bigrams_df, 20), aes(x = reorder(bigram, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "lightgreen") +
labs(title = "Top 20 Bigrams (with stop words)", x = "Bigram", y = "Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
plot_trigrams <- ggplot(head(top100_trigrams_df, 20), aes(x = reorder(trigram, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "coral") +
labs(title = "Top 20 Trigrams (with stop words)", x = "Trigram", y = "Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
grid.arrange(plot_unigrams, plot_bigrams, plot_trigrams, ncol = 1)
The Exploratory Data Analysis done in this report gave us interesting insight in characteristics such as the number of words per line depending on the type of document (blog, news or twitter messages). For example, we learned that the number of words per line is very small in the case of twitter message (below 40) compared to blog and news documents. The corpora were also sampled and cleaned, and relationships between words were investigated using the quanteda library and transforming the strings into n-grams. We saw the importance of filtering out stop words to get relevant information about the context of the text, and we studied some of the top most common uni-, bi-, and trigrams. For the creation of a Shiny App that will allow a user to enter a word and give a predictive suggestion for the next possible word, bigrams and trigrams (and maybe even further n-grams) seem to be of utmost importance, since those give us an information of possible future word choices depending on the already written words. As a next step, one could expand on further n-grams, taking into account stop words because of their prevalence, and give them different weights.