Introduction

The objective of the Capstone project is to develop a data application that reliably predicts the next word in a sentence. The application will be trained on data from Twitter, newsfeeds and blogs that were collected. The first step towards this objective is to explore the data that was provided for further modelling. The results of the data exploration is described in this paper.

Data Processing

Data loading

Data loading starts with loading the libraries needed, setting the directory, setting the seed for reproducible results and reading the data files.

library(ggplot2)
library(NLP)
library(tm)
library(qdap)
library(ngram)
library(dplyr)
library(tidytext)
library(tidyr)
setwd("~/Datasciencecoursera/Module 10 Capstone project/Data/en_US")
set.seed(54322)
blogs <- readLines("en_US.blogs.txt")
news <- readLines("en_US.news.txt")
twitter <- readLines("en_US.twitter.txt")

Data Exploration - numbers of lines, words in the data sets

Now we check the characteristics of the data set.

length(news)

## [1] 77259

length(blogs)

## [1] 899288

length(twitter)

## [1] 2360148

object.size(blogs)

## 260564320 bytes

object.size(news)

## 20111392 bytes

object.size(twitter)

## 316037344 bytes

So news has 77.259 lines, blogs 899.288 lines and twitter 2.360.148 lines. Given the abovementioned object sizes and because the calculation of ngrams is heavy on memory, we must therefore choose to sample both the blogs and twitter text lines. But first some more cleaning of the input data and data exploration. On casual manual inspection, it was visible that some rows contained non-ASCII characters. With the processing below, the non-ASCII characters are removed from the text.

knitr::opts_chunk$set(cache = TRUE, echo = TRUE)
news <- news[!grepl("this_is_not_ascii", 
                    iconv(news, "latin1", "ASCII", sub="this_is_not_ascii"))]
blogs <- blogs[!grepl("this_is_not_ascii", 
                    iconv(blogs, "latin1", "ASCII", sub="this_is_not_ascii"))]
twitter <- twitter[!grepl("this_is_not_ascii", 
                    iconv(twitter, "latin1", "ASCII", sub="this_is_not_ascii"))]

print(paste("Number of lines in news text: ", length(news)))

## [1] "Number of lines in news text:  66780"

print(paste("Number of lines in blogs text: ",length(blogs)))

## [1] "Number of lines in blogs text:  636261"

print(paste("Number of lines in twitter text: ", length(twitter)))

## [1] "Number of lines in twitter text:  2282717"

After having removed the non-ASCII character lines, news, blog and twitter have 66.780 rows, 636.261 rows and 2.360.148 rows respectively. Comparison shows that 14 % of the lines of the news data set contains non-ASCII, whereas 29% of the blog data contains non-ASCII and twitter contains 0% non-ASCII lines.

Now we can count the characters and words in the lines of the data sets. First the characters. Via the nchar() function we can easily generate basic statistics with a histogram.

The news and blog lines in general go up to 1000 characters per line, but there are a few lines that go up to 6-9.000 characters. One can clearly see the limitation of twitter messages to 140 characters. For the news and blog texts, further research can be done in the modeling phase what the influence of the long lines is.

Data Exploration - n-gram statistics of the data sets

Now, we can count the word frequencies (on basis of the unigrams) and the bigrams and trigrams. After first sampling the blogs and twitter text lines. Sampling with 10% of the lines gives a reliable view (with over 99% confidence level and 0.5% confidence interval). After removal of the stopwords (via the function rm_stopwords)

twitter_sample <- sample(twitter, length(twitter)/10)
blogs_sample <- sample(blogs, length(blogs)/10)
total_text <- c(twitter_sample, blogs_sample, news)
# some basic cleaning
total_text <- removeNumbers(total_text)
total_text <- tolower(total_text)

knitr::opts_chunk$set(cache = TRUE, echo = TRUE)
unigrams_df <- as.data.frame(total_text) %>%  
  unnest_tokens(ngram, total_text, token = "ngrams", n = 1) %>% 
  count(ngram, sort = TRUE)
bigrams_df <- as.data.frame(total_text) %>%  
  unnest_tokens(ngram, total_text, token = "ngrams", n = 2) %>% 
  count(ngram, sort = TRUE)

print(unigrams_df, n = 10)

## # A tibble: 147,207 x 2
##    ngram      n
##    <chr>  <int>
##  1 the   321405
##  2 to    194748
##  3 a     167302
##  4 and   163130
##  5 of    132453
##  6 i     127544
##  7 in    113480
##  8 for    81929
##  9 is     77171
## 10 you    76470
## # ... with 1.472e+05 more rows

print(bigrams_df, n = 10)

## # A tibble: 2,192,017 x 2
##    ngram        n
##    <chr>    <int>
##  1 in the   27952
##  2 of the   27462
##  3 for the  15111
##  4 to the   14335
##  5 on the   13816
##  6 to be    11001
##  7 at the   10103
##  8 in a      8153
##  9 and the   8144
## 10 with the  6994
## # ... with 2.192e+06 more rows

# now filter the n-grams 
unigrams_filtered_df <- unigrams_df %>% filter(!ngram %in% stop_words$word)
bigrams_separated <- bigrams_df %>%
  separate(ngram, c("word1", "word2"), sep = " ")
bigrams_filtered_df <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

# trigrams
trigrams_filtered_df <- as.data.frame(total_text) %>%  
  unnest_tokens(trigram, total_text, token = "ngrams", n = 3) %>%   
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         !word3 %in% stop_words$word) %>%
  count(word1, word2, word3, sort = TRUE)

print(unigrams_filtered_df, 10)

## # A tibble: 146,487 x 2
##    ngram      n
##    <chr>  <int>
##  1 time   16540
##  2 day    14439
##  3 love   13532
##  4 people 10979
##  5 rt      8445
##  6 lol     6907
##  7 night   6358
##  8 home    6297
##  9 life    6247
## 10 week    6238
## # ... with 146,477 more rows

print(bigrams_filtered_df, 10)

## # A tibble: 850,133 x 3
##    word1    word2         n
##    <chr>    <chr>     <int>
##  1 happy    birthday    872
##  2 st       louis       739
##  3 los      angeles     515
##  4 san      francisco   482
##  5 social   media       442
##  6 san      diego       420
##  7 health   care        363
##  8 ice      cream       342
##  9 mother's day         326
## 10 stay     tuned       323
## # ... with 850,123 more rows

print(trigrams_filtered_df, 10)

## # A tibble: 459,923 x 4
##    word1     word2     word3      n
##    <chr>     <chr>     <chr>  <int>
##  1 happy     mothers   day      189
##  2 happy     mother's  day      139
##  3 cinco     de        mayo     107
##  4 president barack    obama     86
##  5 st        louis     county    75
##  6 world     war       ii        62
##  7 st        patrick's day       52
##  8 love      love      love      47
##  9 martin    luther    king      42
## 10 ha        ha        ha        38
## # ... with 459,913 more rows

Withour the removal of stopwords, one can clearly see that the most frequent words and word combinations in bigrams are stopwords. Given the low information value of these words for predicting the next word, it is clear that the removal of the stopwords is necessary to make the prodiction model work. The last unigrams, bigrams and trigrams show that this is a better base for the prediction model.

Now we can check how many words you need to cover 50% of the words that appear in the corpus text (which should be a measure of the source texts)

knitr::opts_chunk$set(cache = TRUE, echo = TRUE)
total_words_in_corpus <- sum(unigrams_filtered_df[,2])
total_bigrams_in_corpus <- sum(bigrams_filtered_df[,3])
total_trigrams_in_corpus <- nrow(trigrams_filtered_df[,4])
for (i in 1:nrow(unigrams_filtered_df)) {
  percentage_covered <- (sum(unigrams_filtered_df[1:i,2])/total_words_in_corpus)*100
  number_of_words_needed_for_50_percent <- i
  if (round(percentage_covered) >= 50) { break } 
 }
number_of_words_needed_for_50_percent

## [1] 1514

# Percentage of words needed for 50% dictionary coverage 
(number_of_words_needed_for_50_percent/nrow(unigrams_filtered_df))*100

## [1] 1.033539

for (i in 1:nrow(bigrams_filtered_df)) {
  percentage_covered <- (sum(bigrams_filtered_df[1:i,3])/total_bigrams_in_corpus)*100
  number_of_bigrams_needed_for_50_percent <- i
  if (round(percentage_covered) >= 50) { break } 
}
number_of_bigrams_needed_for_50_percent

## [1] 268023

# Percentage of bigrams needed for 50% dictionary coverage 
(number_of_bigrams_needed_for_50_percent/nrow(bigrams_filtered_df))*100

## [1] 31.52718

for (i in 1:nrow(trigrams_filtered_df)) {
  percentage_covered <- (sum(trigrams_filtered_df[1:i,4])/total_trigrams_in_corpus)*100
  number_of_trigrams_needed_for_50_percent <- i
  if (round(percentage_covered) >= 50) { break } 
}
number_of_trigrams_needed_for_50_percent

## [1] 205718

# Percentage of trigrams needed for 50% dictionary coverage 
(number_of_trigrams_needed_for_50_percent/nrow(trigrams_filtered_df))*100

## [1] 44.72879

As you can see from the output, for the unigrams already 1% of the top appearing words will cover 50% of the occurring words in the corpus. With the bigrams that percentage has steeply risen (32%) and rises even more with the trigrams (45%). For the development of the model this means that efficiencies can be acheived with lesser unique words, not so much with the bigrams and trigrams. This can be explained with the bigrams and trigrams frequency. The top bigrams and trigrams appear relatively much less than the top unique words, while the number of bigrams and trigrams is mch larger than unigrams. This gives a much flatter probability distribution and leads to more rows that are needed to give a proper coverage within the prediction model.

Identification of foreign words in texts

The identification of foreign words in a source text is a complex task due to the usually occurring exchange of words between languages and dialects within a languague. In a practical way, the number of non-ASCII characters appearing may be indicative of non-English text. In that case blogs would be 29% non-English and the news texts 14%. Twitter messages would be nearly all English with the same reasoning but that may be misleading and more caused by restriction of non-ASCII of Twitter messages.

The best estimate would be to check against an English dictionary, but I haven’t found a good one for R.

Plans for creating the prediction algorithm and app

On basis of the data exploration, processing time and memory usage are adequate when sampling the rather large corpora for twitter and blogs. In the prediction model the approach would be to use the cleaned data and calculate the probability of the upcoming word with Markov chains based on the unigram, bigram and trigram frequencies. We may even use 4-grams if memeory allows. As can be seen from the top appearing ngrams, there are many more combnations of words and not all may appear. Smoothing techniques to estimate probabilities for word combinations that do not appear in the corpora could be necessary. Therefore techniques such as Kneser-Ney smoothing may be applied, after evaluation of the first test results.

Milestone Report week 2 Capstone Project

T.Roelofs

26 January 2018