This document is a exploratory anaysis as preperation to develop a model that takes a piece of text as input and returns three possibilities as the next word in the text. The goal of this anaylsis is getting to how text data can be modeld into a structured dataset and finding quelaites of text data that will give a indication og how a model can be built to predict a next word.
Out of the exploratory analysis we find that we can discern the following word in a text by knowing what the provious words are. We can train a model with word sequence frequencies, and once we need to find the next word we lok at the previous word sequence and predict the next word. We also see that the amount of data in our trainig set is very vast. We therefor need to think of a way to make the model less complex to gain speed performance.
Key findings:
Conclusions:
Natural Text Processing is becoming a field that has given value to businesses and markets all over the world. The benefits of NLP range from sentiment analysis, topic modelig and information extraction. Another applicaiton of NLP is predictive text editing. This paper revolves around the explorative analysis of different text-data with the objective of finding a direction of a model to make prediction of a following word while someone is typing. In this exploratory anaysis we are interested in the following questions:
We have a datbase of text documents coming from twitter feeds, news articles and blogs. We have these textdocuments in 4 languages: English, Russian, German and Finish. For the model that we want to create we will be focussing on a English text editor. We will therefor do only analysis on the English text documents.
# Make connection to textdocument files
connection_twitter <- file("final/en_US/en_US.twitter.txt")
connection_news <- file("final/en_US/en_US.news.txt")
connection_blogs <- file("final/en_US/en_US.blogs.txt")
#Read lines in text documents
linestwitter <- readLines(con = connection_twitter, skipNul = TRUE)
linesnews <- readLines(con = connection_news, skipNul = TRUE)
linesblogs <- readLines(con = connection_blogs, skipNul = TRUE)
#Close file connections
close(connection_twitter)
close(connection_news)
close(connection_blogs)
We will spit the dta into a test and train dataset (70% goes to training set and 30% to test set). We will only ook at the train dataset in the explorqatory fase to circumvent possible baises and be able to evaluate a model correcty.
##Create train and test dataset
get_train <- function(x){
set.seed(333)
train <- x[as.logical(rbinom(length(x), 1, 0.7))]
return(train)}
get_test <- function(x){
set.seed(333)
test <- x[!as.logical(rbinom(length(x), 1, 0.7))]
return(test)}
# Get training set for the 3 types of text documents and store test sets for future use.
connection_twitter <- file("final/twitter_test.txt")
connection_blogs <- file("final/blogs_test.txt")
connection_news <- file("final/news_test.txt")
twitter_train <- get_train(linestwitter)
#writeLines(get_test(linestwitter), connection_twitter)
news_train <- get_train(linesnews)
#writeLines(get_test(linesnews), connection_news)
blogs_train <- get_train(linesblogs)
#writeLines(get_test(linesblogs), connection_blogs)
close(connection_twitter)
close(connection_blogs)
close(connection_news)
We take a first look at the data at see what type of cleaing we would likt to preform.
Twitter feeds
head(twitter_train, 5)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
## [4] "Words from a complete stranger! Made my birthday even better :)"
## [5] "i no! i get another day off from skool due to the wonderful snow (: and THIS wakes me up...damn thing"
Findings: for twitter feeds we see that the corpus contains a lot of words that are not spelled correcly and/or we see casual speech. This is as expected as twitter is used in a casual fashion to send out small/short messages.
Blogs
head(blogs_train, 5)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â\200œgodsâ\200\235."
## [2] "We love you Mr. Brown."
## [3] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."
## [4] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"
## [5] "If I were a bear,"
Findings: for blogs we see more correct language and more complete sentences that are gramatically sound.
News
head(news_train, 5)
## [1] "He wasn't home alone, apparently."
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
## [3] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [4] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"
## [5] "14915 Charlevoix, Detroit"
Findings: for news we also see correct language and more complete sentences that are gramatically sound.
We want to clean the data so that we can start the exporatory analysis. Considereing we want to create a predicitve text editor we do not want to remove any words from the corpus. we do however want to break down the structure of the text so that we can do good analysis over them. We will therefore clean the data having the following assumptions:
Having these assupmtions based on the first inspection of the data we will do the following cleaning steps:
prep <- function(x){
x <- unlist(tokenize_sentences(x,
lowercase = TRUE,
strip_punct = TRUE,
simplify = TRUE))
x <- str_squish(x)
return(x)}
twitter_train <- prep(twitter_train)
blogs_train <- prep(blogs_train)
news_train <- prep(news_train)
head(news_train)
## [1] "he wasnt home alone apparently"
## [2] "the st"
## [3] "louis plant had to close"
## [4] "it would die of old age"
## [5] "workers had been making cars there since the onset of mass automotive production in the 1920s"
## [6] "the alaimo group of mount holly was up for a contract last fall to evaluate and suggest improvements to trenton water works"
Analysis plan TO see how we can predict the followin words in a scentence we want to know what the data looks like. We will there for analyse what the frequency is of unigrams (individual words), bigrams (sequence of 2 words) and trigrams (sequence of 3 words) the sentences over the different sources we have (twitter, blogs and news). We also want to see if there is big difference between the frequency of unigrams, bigrams and trigrams between our scources.
We forst want to create the ungrams and explore the word frequency in the three scources.
#Create unigram
create_unigram <- function(x){
x <- tokenize_words(x)
x <- table(unlist(x))
x <- data.frame(x)
x <- x[order(x$Freq, decreasing = TRUE),]
return(x)}
twitter_ng1_freq <- create_unigram(twitter_train)
blogs_ng1_freq <- create_unigram(blogs_train)
news_ng1_freq <- create_unigram(news_train)
c(nrow(twitter_ng1_freq), nrow(blogs_ng1_freq), nrow(news_ng1_freq))
## [1] 394554 369040 83946
We see that witter and blog text documents have roughly the same amount unique words with approximately 370,000 to 400,000 words. The news documents have far less unique worrds with approximaly 80,000 words.
#Visualizing of word frequency
p1 <- ggplot(twitter_ng1_freq) +
geom_histogram(aes(x = Freq), bins = 100) +
xlim(0,50) +
labs(x = "Word frequency", y = "Count", title = "TWITTER")
p2 <- ggplot(blogs_ng1_freq) +
geom_histogram(aes(x = Freq), bins = 100) +
xlim(0,50) +
labs(x = "Word frequency", y = "Count", title = "BLOGS")
p3 <- ggplot(news_ng1_freq) +
geom_histogram(aes(x = Freq), bins = 100) +
xlim(0,50) +
labs(x = "Word frequency", y = "Count", title = "NEWS")
grid.arrange(p1, p2, p3, ncol = 3)
In the figure above we see the distribution of word frequency in the differenct sources. De dsitribution is strongly right skewed which indicates that most words are used very infrequent and that few words are used more frequent. The shape of the distribution looks the same for the all sources.
c(mean(twitter_ng1_freq$Freq < 5), mean(blogs_ng1_freq$Freq < 5), mean(news_ng1_freq$Freq < 5))
## [1] 0.8384049 0.7837660 0.7567841
In the above calculation we see that more than 75% of all the words have a frequency that is lower than 10.
#% coverage of words vs. unique words
p1 <- ggplot() +
geom_line(aes(x = 1:1000, y = cumsum(twitter_ng1_freq[1:1000,]$Freq)/sum(twitter_ng1_freq$Freq))) +
labs(x = "# unique words", y = "% of total words", title = "TWITTER") + ylim(0,0.8)
p2 <- ggplot() +
geom_line(aes(x = 1:1000, y = cumsum(blogs_ng1_freq[1:1000,]$Freq)/sum(blogs_ng1_freq$Freq))) +
labs(x = "# uniwue words", y = "% of total words", title = "BLOGS") + ylim(0,0.8)
p3 <- ggplot() +
geom_line(aes(x = 1:1000, y = cumsum(news_ng1_freq[1:1000,]$Freq)/sum(news_ng1_freq$Freq))) +
labs(x = "# unique words", y = "% of total words", title = "NEWS") + ylim(0,0.8)
grid.arrange(p1, p2, p3, ncol = 3)
If we look at the percentage of the total words covorage compared to the ampount of unique words, we see that the top 250 words account for over 50% of the total corpus. We do see that in the news articles this freaction increases with a slower rate that with blogs and twitter messages. At 1000 unqie words the twitter and blogs messages have have been covered for approximaltely 75%m whereas the news articles are still lower than 70%.
##% Overlap of most frequently used words
twitter_blogs_overlap <- c()
twitter_news_overlap <- c()
blogs_news_overlap <- c()
for(i in seq(10, 500, 10)){
twitter_blogs_overlap <- c(twitter_blogs_overlap, mean(twitter_ng1_freq[1:i, 1] %in% blogs_ng1_freq[1:i, 1]))
twitter_news_overlap <- c(twitter_news_overlap, mean(twitter_ng1_freq[1:i, 1] %in% news_ng1_freq[1:i, 1]))
blogs_news_overlap <- c(blogs_news_overlap, mean(blogs_ng1_freq[1:i, 1] %in% news_ng1_freq[1:i, 1]))}
p1 <-ggplot() +
geom_line(aes(x = seq(10, 500, 10), y = twitter_blogs_overlap)) +
labs(x = "The first # words", y = "% overlap", title = "TWITTER vs. BLOGS") + ylim(0.55,0.9)
p2 <-ggplot() +
geom_line(aes(x = seq(10, 500, 10), y = twitter_news_overlap)) +
labs(x = "The first # words", y = "% overlap", title = "TWITTER vs. NEWS") + ylim(0.55,0.9)
p3 <- ggplot() +
geom_line(aes(x = seq(10, 500, 10), y = blogs_news_overlap)) +
labs(x = "The first # words", y = "% overlap", title = "BLOGS vs. NEWS") + ylim(0.55,0.9)
grid.arrange(p1, p2, p3, ncol = 3)
c(mean(twitter_ng1_freq[1:150, 1] %in% blogs_ng1_freq[1:150, 1]),
mean(twitter_ng1_freq[1:150, 1] %in% news_ng1_freq[1:150, 1]) ,
mean(blogs_ng1_freq[1:150, 1] %in% news_ng1_freq[1:150, 1]))
## [1] 0.7666667 0.7066667 0.8266667
In the above image we see how much overlap there is between the different sources over he first # most frequently used words. We see thar news and blogs are closesed to each other in the words that are used and how frequently they are used. Twitter and blogs also use many of the same words, but with have less overlap that blogs and news articles have. Twitter and news documents clearly have the least amount of similar words use.
##Create bigrams
create_bigram <- function(x){
x <- x[sapply(x, function(y) {length((strsplit(y, split = " ")[[1]])) > 2})]
x <- ngram(x, n=2)
x <- get.phrasetable(x)
}
twitter_ng2_freq <- create_bigram(twitter_train)
blogs_ng2_freq <- create_bigram(blogs_train)
news_ng2_freq <- create_bigram(news_train)
head(twitter_ng2_freq)
## ngrams freq prop
## 1 in the 54583 0.003019667
## 2 for the 51794 0.002865372
## 3 of the 39809 0.002202332
## 4 on the 33669 0.001862652
## 5 to be 32744 0.001811479
## 6 to the 30266 0.001674390
#Distribution of word frequency
p1 <- ggplot(twitter_ng2_freq) +
geom_histogram(aes(x = freq), bins = 100) +
xlim(0,25) +
labs(x = "Bigram frequency", y = "Count", title = "TWITTER")
p2 <- ggplot(blogs_ng2_freq) +
geom_histogram(aes(x = freq), bins = 100) +
xlim(0,25) +
labs(x = "Bigram frequency", y = "Count", title = "BLOGS")
p3 <- ggplot(news_ng2_freq) +
geom_histogram(aes(x = freq), bins = 100) +
xlim(0,25) +
labs(x = "Bigram frequency", y = "Count", title = "NEWS")
grid.arrange(p1, p2, p3, ncol = 3)
Just like in unigrams we see that the distribution of the bigram frequency is strongly right skewed. This means that most words are used with very low frequency and that there are increasingly fewer words used with high frequency. We do however see that the rate with which the frequency decreases is faster wth bigrams than unigrams; meanign that the distribution is more skewed and relatively more bigrams are used with very low frequency than unigrams.
c(mean(twitter_ng2_freq$freq < 5), mean(blogs_ng2_freq$freq < 5), mean(news_ng2_freq$freq < 5))
## [1] 0.9128107 0.9071131 0.9379017
In the above calculation we see that more than 90% of all the bigrams have a frequency that is lower than 5.
# % Overlap of most frequently used words
twitter_blogs_overlap <- c()
twitter_news_overlap <- c()
blogs_news_overlap <- c()
for(i in seq(10, 500, 10)){
twitter_blogs_overlap <- c(twitter_blogs_overlap, mean(twitter_ng2_freq[1:i, 1] %in% blogs_ng2_freq[1:i, 1]))
twitter_news_overlap <- c(twitter_news_overlap, mean(twitter_ng2_freq[1:i, 1] %in% news_ng2_freq[1:i, 1]))
blogs_news_overlap <- c(blogs_news_overlap, mean(blogs_ng2_freq[1:i, 1] %in% news_ng2_freq[1:i, 1]))}
p1 <-ggplot() +
geom_line(aes(x = seq(10, 500, 10), y = twitter_blogs_overlap)) +
labs(x = "The first # words", y = "% overlap", title = "TWITTER vs. BLOGS") + ylim(0.25,0.8)
p2 <-ggplot() +
geom_line(aes(x = seq(10, 500, 10), y = twitter_news_overlap)) +
labs(x = "The first # words", y = "% overlap", title = "TWITTER vs. NEWS") + ylim(0.25,0.8)
p3 <- ggplot() +
geom_line(aes(x = seq(10, 500, 10), y = blogs_news_overlap)) +
labs(x = "The first # words", y = "% overlap", title = "BLOGS vs. NEWS") + ylim(0.25,0.8)
grid.arrange(p1, p2, p3, ncol = 3)
For the overlap in bigram usage between the diffent sources we see that the news articles and blogs have the most overlap; meaning that the the same bigrams are used similarly. The overlap between news and twitter is the least.
create_trigram <- function(x){
x <- x[sapply(x, function(y) {length((strsplit(y, split = " ")[[1]])) > 3})]
x <- ngram(x, n=3)
x <- get.phrasetable(x)
}
twitter_ng3_freq <- create_trigram(twitter_train)
blogs_ng3_freq <- create_trigram(blogs_train)
news_ng3_freq <- create_trigram(news_train)
#Distribution of word frequency
p1 <- ggplot(twitter_ng3_freq) +
geom_histogram(aes(x = freq), bins = 100) +
xlim(0,25) +
labs(x = "Trigram frequency", y = "Count", title = "TWITTER")
p2 <- ggplot(blogs_ng3_freq) +
geom_histogram(aes(x = freq), bins = 100) +
xlim(0,25) +
labs(x = "Trigram frequency", y = "Count", title = "BLOGS")
p3 <- ggplot(news_ng3_freq) +
geom_histogram(aes(x = freq), bins = 100) +
xlim(0,25) +
labs(x = "Trigram frequency", y = "Count", title = "NEWS")
grid.arrange(p1, p2, p3, ncol = 3)
The distribution of trigram frequency is similar to the unigram and bigram distribution (as expected). We do hoewever see that the distribution decreases more rapudly.
c(mean(twitter_ng3_freq$freq < 5), mean(blogs_ng3_freq$freq < 5), mean(news_ng3_freq$freq < 5))
## [1] 0.9647701 0.9647013 0.9864307
In the above calculation we see that more than 96% of all the trigrams have a frequency that is lower than 5.
# % Overlap of most frequently used words
twitter_blogs_overlap <- c()
twitter_news_overlap <- c()
blogs_news_overlap <- c()
for(i in seq(10, 500, 10)){
twitter_blogs_overlap <- c(twitter_blogs_overlap, mean(twitter_ng3_freq[1:i, 1] %in% blogs_ng3_freq[1:i, 1]))
twitter_news_overlap <- c(twitter_news_overlap, mean(twitter_ng3_freq[1:i, 1] %in% news_ng3_freq[1:i, 1]))
blogs_news_overlap <- c(blogs_news_overlap, mean(blogs_ng3_freq[1:i, 1] %in% news_ng3_freq[1:i, 1]))}
p1 <-ggplot() +
geom_line(aes(x = seq(10, 500, 10), y = twitter_blogs_overlap)) +
labs(x = "The first # words", y = "% overlap", title = "TWITTER vs. BLOGS") + ylim(0.25,0.8)
p2 <-ggplot() +
geom_line(aes(x = seq(10, 500, 10), y = twitter_news_overlap)) +
labs(x = "The first # words", y = "% overlap", title = "TWITTER vs. NEWS") + ylim(0.25,0.8)
p3 <- ggplot() +
geom_line(aes(x = seq(10, 500, 10), y = blogs_news_overlap)) +
labs(x = "The first # words", y = "% overlap", title = "BLOGS vs. NEWS") + ylim(0.25,0.8)
grid.arrange(p1, p2, p3, ncol = 3)
As it is with unigrams and bigrams we see that blogs and news articles have the highest overlap in use of trigrams. The overlap is lowest for twitter feeds and newsarticles.
Key findings:
Conclusions: