The objective of the Capstone project is to develop a data application that reliably predicts the next word in a sentence. The application will be trained on data from Twitter, newsfeeds and blogs that were collected. The first step towards this objective is to explore the data that was provided for further modelling. The results of the data exploration is described in this paper.
Data loading starts with loading the libraries needed, setting the directory, setting the seed for reproducible results and reading the data files.
library(ggplot2)
library(NLP)
library(tm)
library(qdap)
library(ngram)
library(dplyr)
library(tidytext)
library(tidyr)
setwd("~/Datasciencecoursera/Module 10 Capstone project/Data/en_US")
set.seed(54322)
blogs <- readLines("en_US.blogs.txt")
news <- readLines("en_US.news.txt")
twitter <- readLines("en_US.twitter.txt")
Now we check the characteristics of the data set.
length(news)
## [1] 77259
length(blogs)
## [1] 899288
length(twitter)
## [1] 2360148
object.size(blogs)
## 260564320 bytes
object.size(news)
## 20111392 bytes
object.size(twitter)
## 316037344 bytes
So news has 77.259 lines, blogs 899.288 lines and twitter 2.360.148 lines. Given the abovementioned object sizes and because the calculation of ngrams is heavy on memory, we must therefore choose to sample both the blogs and twitter text lines. But first some more cleaning of the input data and data exploration. On casual manual inspection, it was visible that some rows contained non-ASCII characters. With the processing below, the non-ASCII characters are removed from the text.
knitr::opts_chunk$set(cache = TRUE, echo = TRUE)
news <- news[!grepl("this_is_not_ascii",
iconv(news, "latin1", "ASCII", sub="this_is_not_ascii"))]
blogs <- blogs[!grepl("this_is_not_ascii",
iconv(blogs, "latin1", "ASCII", sub="this_is_not_ascii"))]
twitter <- twitter[!grepl("this_is_not_ascii",
iconv(twitter, "latin1", "ASCII", sub="this_is_not_ascii"))]
print(paste("Number of lines in news text: ", length(news)))
## [1] "Number of lines in news text: 66780"
print(paste("Number of lines in blogs text: ",length(blogs)))
## [1] "Number of lines in blogs text: 636261"
print(paste("Number of lines in twitter text: ", length(twitter)))
## [1] "Number of lines in twitter text: 2282717"
After having removed the non-ASCII character lines, news, blog and twitter have 66.780 rows, 636.261 rows and 2.360.148 rows respectively. Comparison shows that 14 % of the lines of the news data set contains non-ASCII, whereas 29% of the blog data contains non-ASCII and twitter contains 0% non-ASCII lines.
Now we can count the characters and words in the lines of the data sets. First the characters. Via the nchar() function we can easily generate basic statistics with a histogram.
The news and blog lines in general go up to 1000 characters per line, but there are a few lines that go up to 6-9.000 characters. One can clearly see the limitation of twitter messages to 140 characters. For the news and blog texts, further research can be done in the modeling phase what the influence of the long lines is.
Now, we can count the word frequencies (on basis of the unigrams) and the bigrams and trigrams. After first sampling the blogs and twitter text lines. Sampling with 10% of the lines gives a reliable view (with over 99% confidence level and 0.5% confidence interval). After removal of the stopwords (via the function rm_stopwords)
twitter_sample <- sample(twitter, length(twitter)/10)
blogs_sample <- sample(blogs, length(blogs)/10)
total_text <- c(twitter_sample, blogs_sample, news)
# some basic cleaning
total_text <- removeNumbers(total_text)
total_text <- tolower(total_text)
knitr::opts_chunk$set(cache = TRUE, echo = TRUE)
unigrams_df <- as.data.frame(total_text) %>%
unnest_tokens(ngram, total_text, token = "ngrams", n = 1) %>%
count(ngram, sort = TRUE)
bigrams_df <- as.data.frame(total_text) %>%
unnest_tokens(ngram, total_text, token = "ngrams", n = 2) %>%
count(ngram, sort = TRUE)
print(unigrams_df, n = 10)
## # A tibble: 147,207 x 2
## ngram n
## <chr> <int>
## 1 the 321405
## 2 to 194748
## 3 a 167302
## 4 and 163130
## 5 of 132453
## 6 i 127544
## 7 in 113480
## 8 for 81929
## 9 is 77171
## 10 you 76470
## # ... with 1.472e+05 more rows
print(bigrams_df, n = 10)
## # A tibble: 2,192,017 x 2
## ngram n
## <chr> <int>
## 1 in the 27952
## 2 of the 27462
## 3 for the 15111
## 4 to the 14335
## 5 on the 13816
## 6 to be 11001
## 7 at the 10103
## 8 in a 8153
## 9 and the 8144
## 10 with the 6994
## # ... with 2.192e+06 more rows
# now filter the n-grams
unigrams_filtered_df <- unigrams_df %>% filter(!ngram %in% stop_words$word)
bigrams_separated <- bigrams_df %>%
separate(ngram, c("word1", "word2"), sep = " ")
bigrams_filtered_df <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
# trigrams
trigrams_filtered_df <- as.data.frame(total_text) %>%
unnest_tokens(trigram, total_text, token = "ngrams", n = 3) %>%
separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word,
!word3 %in% stop_words$word) %>%
count(word1, word2, word3, sort = TRUE)
print(unigrams_filtered_df, 10)
## # A tibble: 146,487 x 2
## ngram n
## <chr> <int>
## 1 time 16540
## 2 day 14439
## 3 love 13532
## 4 people 10979
## 5 rt 8445
## 6 lol 6907
## 7 night 6358
## 8 home 6297
## 9 life 6247
## 10 week 6238
## # ... with 146,477 more rows
print(bigrams_filtered_df, 10)
## # A tibble: 850,133 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 happy birthday 872
## 2 st louis 739
## 3 los angeles 515
## 4 san francisco 482
## 5 social media 442
## 6 san diego 420
## 7 health care 363
## 8 ice cream 342
## 9 mother's day 326
## 10 stay tuned 323
## # ... with 850,123 more rows
print(trigrams_filtered_df, 10)
## # A tibble: 459,923 x 4
## word1 word2 word3 n
## <chr> <chr> <chr> <int>
## 1 happy mothers day 189
## 2 happy mother's day 139
## 3 cinco de mayo 107
## 4 president barack obama 86
## 5 st louis county 75
## 6 world war ii 62
## 7 st patrick's day 52
## 8 love love love 47
## 9 martin luther king 42
## 10 ha ha ha 38
## # ... with 459,913 more rows
Withour the removal of stopwords, one can clearly see that the most frequent words and word combinations in bigrams are stopwords. Given the low information value of these words for predicting the next word, it is clear that the removal of the stopwords is necessary to make the prodiction model work. The last unigrams, bigrams and trigrams show that this is a better base for the prediction model.
Now we can check how many words you need to cover 50% of the words that appear in the corpus text (which should be a measure of the source texts)
knitr::opts_chunk$set(cache = TRUE, echo = TRUE)
total_words_in_corpus <- sum(unigrams_filtered_df[,2])
total_bigrams_in_corpus <- sum(bigrams_filtered_df[,3])
total_trigrams_in_corpus <- nrow(trigrams_filtered_df[,4])
for (i in 1:nrow(unigrams_filtered_df)) {
percentage_covered <- (sum(unigrams_filtered_df[1:i,2])/total_words_in_corpus)*100
number_of_words_needed_for_50_percent <- i
if (round(percentage_covered) >= 50) { break }
}
number_of_words_needed_for_50_percent
## [1] 1514
# Percentage of words needed for 50% dictionary coverage
(number_of_words_needed_for_50_percent/nrow(unigrams_filtered_df))*100
## [1] 1.033539
for (i in 1:nrow(bigrams_filtered_df)) {
percentage_covered <- (sum(bigrams_filtered_df[1:i,3])/total_bigrams_in_corpus)*100
number_of_bigrams_needed_for_50_percent <- i
if (round(percentage_covered) >= 50) { break }
}
number_of_bigrams_needed_for_50_percent
## [1] 268023
# Percentage of bigrams needed for 50% dictionary coverage
(number_of_bigrams_needed_for_50_percent/nrow(bigrams_filtered_df))*100
## [1] 31.52718
for (i in 1:nrow(trigrams_filtered_df)) {
percentage_covered <- (sum(trigrams_filtered_df[1:i,4])/total_trigrams_in_corpus)*100
number_of_trigrams_needed_for_50_percent <- i
if (round(percentage_covered) >= 50) { break }
}
number_of_trigrams_needed_for_50_percent
## [1] 205718
# Percentage of trigrams needed for 50% dictionary coverage
(number_of_trigrams_needed_for_50_percent/nrow(trigrams_filtered_df))*100
## [1] 44.72879
As you can see from the output, for the unigrams already 1% of the top appearing words will cover 50% of the occurring words in the corpus. With the bigrams that percentage has steeply risen (32%) and rises even more with the trigrams (45%). For the development of the model this means that efficiencies can be acheived with lesser unique words, not so much with the bigrams and trigrams. This can be explained with the bigrams and trigrams frequency. The top bigrams and trigrams appear relatively much less than the top unique words, while the number of bigrams and trigrams is mch larger than unigrams. This gives a much flatter probability distribution and leads to more rows that are needed to give a proper coverage within the prediction model.
The identification of foreign words in a source text is a complex task due to the usually occurring exchange of words between languages and dialects within a languague. In a practical way, the number of non-ASCII characters appearing may be indicative of non-English text. In that case blogs would be 29% non-English and the news texts 14%. Twitter messages would be nearly all English with the same reasoning but that may be misleading and more caused by restriction of non-ASCII of Twitter messages.
The best estimate would be to check against an English dictionary, but I haven’t found a good one for R.
On basis of the data exploration, processing time and memory usage are adequate when sampling the rather large corpora for twitter and blogs. In the prediction model the approach would be to use the cleaned data and calculate the probability of the upcoming word with Markov chains based on the unigram, bigram and trigram frequencies. We may even use 4-grams if memeory allows. As can be seen from the top appearing ngrams, there are many more combnations of words and not all may appear. Smoothing techniques to estimate probabilities for word combinations that do not appear in the corpora could be necessary. Therefore techniques such as Kneser-Ney smoothing may be applied, after evaluation of the first test results.