Predictive Text Editor - Exploratory Analysis

Summary

This document is a exploratory anaysis as preperation to develop a model that takes a piece of text as input and returns three possibilities as the next word in the text. The goal of this anaylsis is getting to how text data can be modeld into a structured dataset and finding quelaites of text data that will give a indication og how a model can be built to predict a next word.

Out of the exploratory analysis we find that we can discern the following word in a text by knowing what the provious words are. We can train a model with word sequence frequencies, and once we need to find the next word we lok at the previous word sequence and predict the next word. We also see that the amount of data in our trainig set is very vast. We therefor need to think of a way to make the model less complex to gain speed performance.

Key findings:

Weee see in the distribution of words/bigram/trigram frequency that most words are used with little frequency and that as the frequency increases the count exponentially decreases. The rate with which the count decreases is faster when we increate the number of sequences in the n-gram; i.e. the rate decreases quicker for bigrams than for unigrams, and the rate decreases quicker of trigrams than for bigrams. This means that the higher the complexity in the n-grams, the more unique n-grams we’ll have. This is a quelity that we can use as leverage to predict the next word.
We see that the overlap between news articles is significantly higher with blogs than with twitter feeds. We see that the overlap in general decreases the hiher the n in the n-gram is. This indicates that how higher the n in the n-gram the more unique the word frequency is and the smaller the overlap. This is to be expected considering we have more unique n-grams.

Conclusions:

For the model we can leverage the quality of the amount of unique n-grams to predict the next word. We can increase the n in the n-grams to gain coplexity and hopefully be more accurate in our prediction. We can use the following fall-back method to get to our predictions: if we do not have enough predictions with the highest n-gram, we fall back to (n-1)-gram to get the three predicitons filled. We continue the process untill we get to the unigram where we always will predict the 3 words with the highest frequency., The Idea of this model is that we can predict more accuratley, the more context we have (the more words we have in our n-gram).
Considering the difference in overlap between the words used in the sources we might want to have models that are trained with different data, i.e. we will have models that are trained with only data from news articles, blogs and tiwtter. We can play with the complexity that way. The different models would be able to preform differently for different usages. If we want to implement a text editor for news articles we would more likely want a model that is trained on text of news articles in stead of a model that is trained on twitter feeds because the overlap in words used so small between twitter and news articles.

Introduction

Natural Text Processing is becoming a field that has given value to businesses and markets all over the world. The benefits of NLP range from sentiment analysis, topic modelig and information extraction. Another applicaiton of NLP is predictive text editing. This paper revolves around the explorative analysis of different text-data with the objective of finding a direction of a model to make prediction of a following word while someone is typing. In this exploratory anaysis we are interested in the following questions:

What are the distributions of word frequencies?
What are the frequencies of 2-grams and 3-grams in the dataset?
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? How many words do you need to cover 90% of the entire corpus?

Loading in data

We have a datbase of text documents coming from twitter feeds, news articles and blogs. We have these textdocuments in 4 languages: English, Russian, German and Finish. For the model that we want to create we will be focussing on a English text editor. We will therefor do only analysis on the English text documents.

# Make connection to textdocument files
connection_twitter <- file("final/en_US/en_US.twitter.txt")
connection_news <- file("final/en_US/en_US.news.txt")
connection_blogs <- file("final/en_US/en_US.blogs.txt")

#Read lines in text documents
linestwitter <- readLines(con = connection_twitter, skipNul = TRUE)
linesnews <- readLines(con = connection_news, skipNul = TRUE)
linesblogs <- readLines(con = connection_blogs, skipNul = TRUE)

#Close file connections
close(connection_twitter)
close(connection_news)
close(connection_blogs)

We will spit the dta into a test and train dataset (70% goes to training set and 30% to test set). We will only ook at the train dataset in the explorqatory fase to circumvent possible baises and be able to evaluate a model correcty.

##Create train and test dataset
get_train <- function(x){
        set.seed(333)
        train <- x[as.logical(rbinom(length(x), 1, 0.7))]
        return(train)}
get_test <- function(x){
        set.seed(333)
        test <- x[!as.logical(rbinom(length(x), 1, 0.7))]
        return(test)}


# Get training set for the 3 types of text documents and store test sets for future use.
connection_twitter <- file("final/twitter_test.txt")
connection_blogs <- file("final/blogs_test.txt")
connection_news <- file("final/news_test.txt")

twitter_train <- get_train(linestwitter)
#writeLines(get_test(linestwitter), connection_twitter)

news_train <- get_train(linesnews)
#writeLines(get_test(linesnews), connection_news)

blogs_train <- get_train(linesblogs)
#writeLines(get_test(linesblogs), connection_blogs)

close(connection_twitter)
close(connection_blogs)
close(connection_news)

First inspection of the data

We take a first look at the data at see what type of cleaing we would likt to preform.

Twitter feeds

head(twitter_train, 5)

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"                           
## [4] "Words from a complete stranger! Made my birthday even better :)"                                                
## [5] "i no! i get another day off from skool due to the wonderful snow (: and THIS wakes me up...damn thing"

Findings: for twitter feeds we see that the corpus contains a lot of words that are not spelled correcly and/or we see casual speech. This is as expected as twitter is used in a casual fashion to send out small/short messages.

Blogs

head(blogs_train, 5)

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â\200œgodsâ\200\235."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
## [3] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."                                                                                                                                                                                                                                                                                                                                                                                                                         
## [4] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"
## [5] "If I were a bear,"

Findings: for blogs we see more correct language and more complete sentences that are gramatically sound.

News

head(news_train, 5)

## [1] "He wasn't home alone, apparently."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                                                                                                                                                                                                                                                                                                                                                         
## [3] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [4] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"                                                                                                                                                                                                                                                            
## [5] "14915 Charlevoix, Detroit"

Findings: for news we also see correct language and more complete sentences that are gramatically sound.

Clean data

We want to clean the data so that we can start the exporatory analysis. Considereing we want to create a predicitve text editor we do not want to remove any words from the corpus. we do however want to break down the structure of the text so that we can do good analysis over them. We will therefore clean the data having the following assumptions:

Text documents are made out of sentences and words. Scentences and words will be the main objects that we will want to analyze.
We assume that a the words that are used in a scentence is independant from the previous scentences. We can therefor see a scentence as a independant variable.
We want to do predictions over words. We will therefor remove any unneccessary white spaces and punctuations.

Having these assupmtions based on the first inspection of the data we will do the following cleaning steps:

Tokenize the corpus so that each row is a scentence.
Strip all punctuations.
Set all words to lower case.
remove all extra white spaces.

prep <- function(x){
        x <- unlist(tokenize_sentences(x, 
                                       lowercase = TRUE, 
                                       strip_punct = TRUE, 
                                       simplify = TRUE))
        x <- str_squish(x)
        return(x)}

twitter_train <- prep(twitter_train)
blogs_train <- prep(blogs_train)
news_train <- prep(news_train)

head(news_train)

## [1] "he wasnt home alone apparently"                                                                                             
## [2] "the st"                                                                                                                     
## [3] "louis plant had to close"                                                                                                   
## [4] "it would die of old age"                                                                                                    
## [5] "workers had been making cars there since the onset of mass automotive production in the 1920s"                              
## [6] "the alaimo group of mount holly was up for a contract last fall to evaluate and suggest improvements to trenton water works"

Exploratory analysis

Analysis plan TO see how we can predict the followin words in a scentence we want to know what the data looks like. We will there for analyse what the frequency is of unigrams (individual words), bigrams (sequence of 2 words) and trigrams (sequence of 3 words) the sentences over the different sources we have (twitter, blogs and news). We also want to see if there is big difference between the frequency of unigrams, bigrams and trigrams between our scources.

Unigrams

We forst want to create the ungrams and explore the word frequency in the three scources.

#Create unigram
create_unigram <- function(x){
        x <- tokenize_words(x)
        x <- table(unlist(x))
        x <- data.frame(x)
        x <- x[order(x$Freq, decreasing = TRUE),]
        return(x)}

twitter_ng1_freq <- create_unigram(twitter_train)
blogs_ng1_freq <- create_unigram(blogs_train)
news_ng1_freq <- create_unigram(news_train)

c(nrow(twitter_ng1_freq), nrow(blogs_ng1_freq), nrow(news_ng1_freq))

## [1] 394554 369040  83946

We see that witter and blog text documents have roughly the same amount unique words with approximately 370,000 to 400,000 words. The news documents have far less unique worrds with approximaly 80,000 words.

Words frequency distribution

#Visualizing of word frequency
p1 <- ggplot(twitter_ng1_freq) + 
        geom_histogram(aes(x = Freq), bins = 100) + 
        xlim(0,50) +
        labs(x = "Word frequency", y = "Count", title = "TWITTER")
p2 <- ggplot(blogs_ng1_freq) + 
        geom_histogram(aes(x = Freq), bins = 100) + 
        xlim(0,50) +
        labs(x = "Word frequency", y = "Count", title = "BLOGS")
p3 <- ggplot(news_ng1_freq) + 
        geom_histogram(aes(x = Freq), bins = 100) + 
        xlim(0,50) +
        labs(x = "Word frequency", y = "Count", title = "NEWS")
grid.arrange(p1, p2, p3, ncol = 3)

In the figure above we see the distribution of word frequency in the differenct sources. De dsitribution is strongly right skewed which indicates that most words are used very infrequent and that few words are used more frequent. The shape of the distribution looks the same for the all sources.

c(mean(twitter_ng1_freq$Freq < 5), mean(blogs_ng1_freq$Freq < 5), mean(news_ng1_freq$Freq < 5))

## [1] 0.8384049 0.7837660 0.7567841

In the above calculation we see that more than 75% of all the words have a frequency that is lower than 10.

% Word covarage vs. # Unique words

#% coverage of words vs. unique words
p1 <- ggplot() + 
        geom_line(aes(x = 1:1000, y = cumsum(twitter_ng1_freq[1:1000,]$Freq)/sum(twitter_ng1_freq$Freq))) +
        labs(x = "# unique words", y = "% of total words", title = "TWITTER") + ylim(0,0.8)
p2 <- ggplot() + 
        geom_line(aes(x = 1:1000, y = cumsum(blogs_ng1_freq[1:1000,]$Freq)/sum(blogs_ng1_freq$Freq))) +
        labs(x = "# uniwue words", y = "% of total words", title = "BLOGS") + ylim(0,0.8)
p3 <- ggplot() + 
        geom_line(aes(x = 1:1000, y = cumsum(news_ng1_freq[1:1000,]$Freq)/sum(news_ng1_freq$Freq))) +
        labs(x = "# unique words", y = "% of total words", title = "NEWS") + ylim(0,0.8)
grid.arrange(p1, p2, p3, ncol = 3)

If we look at the percentage of the total words covorage compared to the ampount of unique words, we see that the top 250 words account for over 50% of the total corpus. We do see that in the news articles this freaction increases with a slower rate that with blogs and twitter messages. At 1000 unqie words the twitter and blogs messages have have been covered for approximaltely 75%m whereas the news articles are still lower than 70%.

Overlap of most frequent words between sources

##% Overlap of most frequently used words
twitter_blogs_overlap <- c()
twitter_news_overlap <- c()
blogs_news_overlap <- c()
for(i in seq(10, 500, 10)){
        twitter_blogs_overlap <- c(twitter_blogs_overlap, mean(twitter_ng1_freq[1:i, 1] %in% blogs_ng1_freq[1:i, 1]))
        twitter_news_overlap <- c(twitter_news_overlap, mean(twitter_ng1_freq[1:i, 1] %in% news_ng1_freq[1:i, 1]))
        blogs_news_overlap <- c(blogs_news_overlap, mean(blogs_ng1_freq[1:i, 1] %in% news_ng1_freq[1:i, 1]))}
p1 <-ggplot() + 
        geom_line(aes(x = seq(10, 500, 10), y = twitter_blogs_overlap)) +
        labs(x = "The first # words", y = "% overlap", title = "TWITTER vs. BLOGS") + ylim(0.55,0.9)
p2 <-ggplot() + 
        geom_line(aes(x = seq(10, 500, 10), y = twitter_news_overlap)) +
        labs(x = "The first # words", y = "% overlap", title = "TWITTER vs. NEWS") + ylim(0.55,0.9)
p3 <- ggplot() + 
        geom_line(aes(x = seq(10, 500, 10), y = blogs_news_overlap)) +
        labs(x = "The first # words", y = "% overlap", title = "BLOGS vs. NEWS") + ylim(0.55,0.9)
grid.arrange(p1, p2, p3, ncol = 3)

c(mean(twitter_ng1_freq[1:150, 1] %in% blogs_ng1_freq[1:150, 1]), 
  mean(twitter_ng1_freq[1:150, 1] %in% news_ng1_freq[1:150, 1]) ,
  mean(blogs_ng1_freq[1:150, 1] %in% news_ng1_freq[1:150, 1]))

## [1] 0.7666667 0.7066667 0.8266667

In the above image we see how much overlap there is between the different sources over he first # most frequently used words. We see thar news and blogs are closesed to each other in the words that are used and how frequently they are used. Twitter and blogs also use many of the same words, but with have less overlap that blogs and news articles have. Twitter and news documents clearly have the least amount of similar words use.

Bigrams

##Create bigrams
create_bigram <- function(x){
        x <- x[sapply(x, function(y) {length((strsplit(y, split = " ")[[1]])) > 2})]
        x <- ngram(x, n=2)
        x <- get.phrasetable(x)
}

twitter_ng2_freq <- create_bigram(twitter_train)
blogs_ng2_freq <- create_bigram(blogs_train)
news_ng2_freq <- create_bigram(news_train)

head(twitter_ng2_freq)

##     ngrams  freq        prop
## 1  in the  54583 0.003019667
## 2 for the  51794 0.002865372
## 3  of the  39809 0.002202332
## 4  on the  33669 0.001862652
## 5   to be  32744 0.001811479
## 6  to the  30266 0.001674390

Bigram frequency distribution

#Distribution of word frequency
p1 <- ggplot(twitter_ng2_freq) + 
        geom_histogram(aes(x = freq), bins = 100) + 
        xlim(0,25) +
        labs(x = "Bigram frequency", y = "Count", title = "TWITTER")
p2 <- ggplot(blogs_ng2_freq) + 
        geom_histogram(aes(x = freq), bins = 100) + 
        xlim(0,25) +
        labs(x = "Bigram frequency", y = "Count", title = "BLOGS")
p3 <- ggplot(news_ng2_freq) + 
        geom_histogram(aes(x = freq), bins = 100) + 
        xlim(0,25) +
        labs(x = "Bigram frequency", y = "Count", title = "NEWS")
grid.arrange(p1, p2, p3, ncol = 3)

Just like in unigrams we see that the distribution of the bigram frequency is strongly right skewed. This means that most words are used with very low frequency and that there are increasingly fewer words used with high frequency. We do however see that the rate with which the frequency decreases is faster wth bigrams than unigrams; meanign that the distribution is more skewed and relatively more bigrams are used with very low frequency than unigrams.

c(mean(twitter_ng2_freq$freq < 5), mean(blogs_ng2_freq$freq < 5), mean(news_ng2_freq$freq < 5))

## [1] 0.9128107 0.9071131 0.9379017

In the above calculation we see that more than 90% of all the bigrams have a frequency that is lower than 5.

Overlap of most frequent bigrams between sources

# % Overlap of most frequently used words
twitter_blogs_overlap <- c()
twitter_news_overlap <- c()
blogs_news_overlap <- c()
for(i in seq(10, 500, 10)){
        twitter_blogs_overlap <- c(twitter_blogs_overlap, mean(twitter_ng2_freq[1:i, 1] %in% blogs_ng2_freq[1:i, 1]))
        twitter_news_overlap <- c(twitter_news_overlap, mean(twitter_ng2_freq[1:i, 1] %in% news_ng2_freq[1:i, 1]))
        blogs_news_overlap <- c(blogs_news_overlap, mean(blogs_ng2_freq[1:i, 1] %in% news_ng2_freq[1:i, 1]))}
p1 <-ggplot() + 
        geom_line(aes(x = seq(10, 500, 10), y = twitter_blogs_overlap)) +
        labs(x = "The first # words", y = "% overlap", title = "TWITTER vs. BLOGS") + ylim(0.25,0.8)
p2 <-ggplot() + 
        geom_line(aes(x = seq(10, 500, 10), y = twitter_news_overlap)) +
        labs(x = "The first # words", y = "% overlap", title = "TWITTER vs. NEWS") + ylim(0.25,0.8)
p3 <- ggplot() + 
        geom_line(aes(x = seq(10, 500, 10), y = blogs_news_overlap)) +
        labs(x = "The first # words", y = "% overlap", title = "BLOGS vs. NEWS") + ylim(0.25,0.8)
grid.arrange(p1, p2, p3, ncol = 3)

For the overlap in bigram usage between the diffent sources we see that the news articles and blogs have the most overlap; meaning that the the same bigrams are used similarly. The overlap between news and twitter is the least.

Trigram

create_trigram <- function(x){
        x <- x[sapply(x, function(y) {length((strsplit(y, split = " ")[[1]])) > 3})]
        x <- ngram(x, n=3)
        x <- get.phrasetable(x)
}

twitter_ng3_freq <- create_trigram(twitter_train)
blogs_ng3_freq <- create_trigram(blogs_train)
news_ng3_freq <- create_trigram(news_train)

Trigram frequency distribution

#Distribution of word frequency
p1 <- ggplot(twitter_ng3_freq) + 
        geom_histogram(aes(x = freq), bins = 100) + 
        xlim(0,25) +
        labs(x = "Trigram frequency", y = "Count", title = "TWITTER")
p2 <- ggplot(blogs_ng3_freq) + 
        geom_histogram(aes(x = freq), bins = 100) + 
        xlim(0,25) +
        labs(x = "Trigram frequency", y = "Count", title = "BLOGS")
p3 <- ggplot(news_ng3_freq) + 
        geom_histogram(aes(x = freq), bins = 100) + 
        xlim(0,25) +
        labs(x = "Trigram frequency", y = "Count", title = "NEWS")
grid.arrange(p1, p2, p3, ncol = 3)

The distribution of trigram frequency is similar to the unigram and bigram distribution (as expected). We do hoewever see that the distribution decreases more rapudly.

c(mean(twitter_ng3_freq$freq < 5), mean(blogs_ng3_freq$freq < 5), mean(news_ng3_freq$freq < 5))

## [1] 0.9647701 0.9647013 0.9864307

In the above calculation we see that more than 96% of all the trigrams have a frequency that is lower than 5.

Overlap of most frequent trigrams between sources

# % Overlap of most frequently used words
twitter_blogs_overlap <- c()
twitter_news_overlap <- c()
blogs_news_overlap <- c()
for(i in seq(10, 500, 10)){
        twitter_blogs_overlap <- c(twitter_blogs_overlap, mean(twitter_ng3_freq[1:i, 1] %in% blogs_ng3_freq[1:i, 1]))
        twitter_news_overlap <- c(twitter_news_overlap, mean(twitter_ng3_freq[1:i, 1] %in% news_ng3_freq[1:i, 1]))
        blogs_news_overlap <- c(blogs_news_overlap, mean(blogs_ng3_freq[1:i, 1] %in% news_ng3_freq[1:i, 1]))}
p1 <-ggplot() + 
        geom_line(aes(x = seq(10, 500, 10), y = twitter_blogs_overlap)) +
        labs(x = "The first # words", y = "% overlap", title = "TWITTER vs. BLOGS") + ylim(0.25,0.8)
p2 <-ggplot() + 
        geom_line(aes(x = seq(10, 500, 10), y = twitter_news_overlap)) +
        labs(x = "The first # words", y = "% overlap", title = "TWITTER vs. NEWS") + ylim(0.25,0.8)
p3 <- ggplot() + 
        geom_line(aes(x = seq(10, 500, 10), y = blogs_news_overlap)) +
        labs(x = "The first # words", y = "% overlap", title = "BLOGS vs. NEWS") + ylim(0.25,0.8)
grid.arrange(p1, p2, p3, ncol = 3)

As it is with unigrams and bigrams we see that blogs and news articles have the highest overlap in use of trigrams. The overlap is lowest for twitter feeds and newsarticles.

Key findings and conclusions