First, I load the data using readLines and the packages that I will use for natural language processing.
library(tm)
library(NLP)
##
## Attaching package: 'NLP'
##
## The following objects are masked from 'package:tm':
##
## meta, meta<-
library(openNLP)
twitter <- readLines('en_US.twitter.txt')
blogs <- readLines('en_US.blogs.txt')
news <- readLines('en_US.news.txt')
First, I want to know the length of each dataset, in number of lines:
length(twitter)
## [1] 2360148
length(blogs)
## [1] 899288
length(news)
## [1] 1010242
Next, what's the average length of a line, in characters?
mean(nchar(twitter))
## [1] 68.68
mean(nchar(blogs))
## [1] 230
mean(nchar(news))
## [1] 201.2
So while the Twitter dataset has the most lines, it also has much shorter lines.
I can also plot the distribution of line lengths in each dataset.
We see a gradual decrease in the number of tweets at greater lengths, but with upticks at 120 and 140 (the maximum tweet length).
hist(nchar(twitter))
The news and blogs datasets are both heavily skewed right, indicating that there are some very long news and blog entries that are far above the normal range of lengths.
hist(nchar(news))
hist(nchar(blogs))
Zooming in, we get a better sense of their distributions.
hist(nchar(blogs)[nchar(blogs) < 500])
hist(nchar(news)[nchar(news) < 500])
Interestingly, although the blogs dataset had a higher mean line length than news, the blogs seem to be more clustered towards the low end of the distribution. This suggests that relatively few very long blog posts are bringing up the mean. The news dataset is also skewed right, but it is closer to normally distributed than the blogs dataset.
These datasets are very large, so analyzing all of them will be too time consuming.
Instead, I'll take a sample of a random 0.1% of each dataset to explore. That will be about 2000 tweets, 900 blog posts, and 1000 news articles.
set.seed(1024)
twitter_sample <- twitter[sample(1:length(twitter), round(length(twitter)/1000), replace=F)]
news_sample <- news[sample(1:length(news), round(length(news)/1000), replace=F)]
blog_sample <- news[sample(1:length(blogs), round(length(blogs)/1000), replace=F)]
I've written nested for loops to extract words, 2-grams, 3-grams, and 4-grams from each of the text files. This runs very slowly, so when I am building my model I will need to figure out how to make this run more quickly.
First, I create blank vectors.
twitter_sentences <- c()
twitter_words <- c()
twitter_2grams <- c()
twitter_3grams <- c()
twitter_4grams <- c()
Next, using the sentence_token_annotator function, I split individual tweets into sentences. Then, using nested for-loops, I run through the sentences, standardize them by removing all punctuation and changing all letters to lowercase, split them into individual words and add all words, 2-grams, 3-grams, and 4-grams to my blank vectors.
sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = 'en')
for(tweet in twitter_sample){
tweet <- as.String(tweet)
sentence_boundaries <- annotate(tweet, sentence_token_annotator)
tweet_sentences <- tweet[sentence_boundaries]
for(sentence in tweet_sentences){
# remove capitalization and punctuation
sentence_without_punc <- gsub("[[:punct:]]", "", as.character(sentence))
sentence_clean <- tolower(sentence_without_punc)
twitter_sentences <- c(twitter_sentences, sentence_clean)
sentence_words <- strsplit(sentence_clean,split=" ")[[1]]
for(i in 0:length(sentence_words)){
twitter_words <- c(twitter_words, sentence_words[i])
if (i > 1){
twitter_2grams <- c (twitter_2grams, paste(sentence_words[i-1], sentence_words[i]))
}
if (i > 2){
twitter_3grams <- c (twitter_3grams, paste(sentence_words[i-2], sentence_words[i-1], sentence_words[i]))
}
if (i > 3){
twitter_4grams <- c (twitter_4grams, paste(sentence_words[i-3], sentence_words[i-2], sentence_words[i-1], sentence_words[i]))
}
}
}
}
twitter_wordfreq = sort(table(twitter_words), decreasing = TRUE)
twitter_2gramfreq = sort(table(twitter_2grams), decreasing = TRUE)
twitter_3gramfreq = sort(table(twitter_3grams), decreasing = TRUE)
twitter_4gramfreq = sort(table(twitter_4grams), decreasing = TRUE)
I can then repeat the process with blogs and news.
Finally, I have a list of most common words and n-grams in each body of text. I'll look at the top 50 of each to show how the text files differ.
Since I took a 0.1% sample of each corpus, I can multiply the resulting word counts by 1000 to get an estimate of each corpus's word count.
print(length(twitter_words)*1000)
## [1] 29553000
print(length(blog_words)*1000)
## [1] 30015000
print(length(news_words)*1000)
## [1] 3.4e+07
print(twitter_wordfreq[1:50])
## twitter_words
## the to i a you and in for of
## 900 759 680 609 505 442 421 394 387 374
## is it my on that me at be have your
## 345 306 299 274 215 187 182 180 178 170
## so are with im just this we like get not
## 168 167 162 155 148 140 140 135 126 121
## out but its was up rt all do what good
## 121 119 117 116 113 112 110 109 108 98
## if thanks when u about from love can will dont
## 94 94 93 88 85 85 85 83 82 81
print(blog_wordfreq[1:50])
## blog_words
## the to a and of in for that is on with he
## 1763 810 782 748 641 604 304 304 254 236 223 222
## said at was it as his but be from have i
## 214 211 211 200 173 162 158 153 148 134 133 128
## an has by its who or this are about they will were
## 114 106 105 103 102 98 98 95 94 90 89 86
## we not one when more out would you had she their been
## 82 80 80 78 77 70 69 69 66 65 65 64
## what up
## 63 61
print(news_wordfreq[1:50])
## news_words
## the to and a in of for that on is with he
## 1860 887 882 870 728 664 338 333 295 272 236 226
## said it was at from as his have be i but
## 225 215 212 192 191 175 173 166 165 156 147 140
## are its an by not this has you who they more will
## 136 132 131 127 123 123 113 110 107 98 96 96
## or when about her we had out new up she what than
## 95 87 86 86 82 80 77 75 75 73 73 71
## were would
## 71 71
As you can see, the three corpora have different word frequencies. The words “I” and “you” are much more common in Twitter, where people are mostly expressing themselves or interacting with others. We also see twitter-specific words such as “rt” and informal non-words such as “u”. The blogs and news datasets have more standard word frequency distributions.
An interesting question is: how many words are needed to cover 50% of the words in each body of text? The following code can find the answer.
for (i in 1:length(twitter_words)){
if(sum(twitter_wordfreq[1:i]) > length(twitter_words)/2.0){
print(i)
break
}
}
## [1] 118
for (i in 1:length(blog_words)){
if(sum(blog_wordfreq[1:i]) > length(blog_words)/2.0){
print(i)
break
}
}
## [1] 194
for (i in 1:length(news_words)){
if(sum(news_wordfreq[1:i]) > length(news_words)/2.0){
print(i)
break
}
}
## [1] 208
This analysis also shows that Twitter, for the most part, uses a smaller set of words, despite the creative spellings. 50% of all words on twitter can be covered with just the 118 most common words, while 194 and 208 words are needed to cover 50% of all blog and news words, respectively.
## twitter_2grams
## in the rt of the for the on the thanks for
## 74 72 70 67 51 46
## to be thank you to the at the going to have a
## 46 42 40 34 34 34
## i just i have if you to see to get i am
## 33 32 31 31 29 28
## i love want to
## 28 28
## twitter_3grams
## thanks for the cant wait to for the follow
## 26 15 12
## i need to to see you looking forward to
## 12 11 10
## check it out im going to of the day
## 8 8 8
## one of the you so much going to be
## 8 8 7
## have a great thank you for would like to
## 7 7 7
## rt a lot of how do you
## 6 6 6
## i have to i just saw
## 6 6
## twitter_4grams
## thanks for the follow add boston add boston thank you for the
## 10 5 5
## boston add boston add hope to see you thank you so much
## 4 4 4
## cant wait to see dont even know what even know what to
## 3 3 3
## going to be a i am going to i dont even know
## 3 3 3
## i will be there if you want to just trying to get
## 3 3 3
## love you so much on the other side thank you thank you
## 3 3 3
## the end of the to see you there
## 3 3
## blog_2grams
## of the in the to the at the for the on the in a and the
## 161 157 87 66 59 55 47 41
## to be with the with a for a he was from the he said as the
## 40 36 35 34 34 31 30 29
## that the one of as a by the
## 29 28 27 27
## blog_3grams
## one of the according to the the end of a lot of
## 15 7 7 6
## part of the as part of dont want to in the third
## 6 5 5 5
## out of the president of the said he was said it would
## 5 5 5 5
## some of the the way the a little bit according to a
## 5 5 4 4
## at the same in the fourth is one of it comes to
## 4 4 4 4
## blog_4grams
## at the same time when it comes to as part of a
## 4 4 3
## by the end of in the united states the blazers are
## 3 3 2
## 60 percent of the a large number of a lot of things
## 2 2 2
## a member of the and not paying for as a member of
## 2 2 2
## at the beginning of at the end of at the start of
## 2 2 2
## avenue on a charge be approved by a come out and play
## 2 2 2
## for the los angeles for the rest of
## 2 2
## news_2grams
## in the of the to the on the for the in a and the
## 172 170 79 75 51 50 45
## to be at the from the with the more than for a will be
## 43 42 42 39 36 34 34
## as a that the with a is a he said by the
## 33 33 33 31 30 27
## news_3grams
## because of the of the season as well as i dont think
## 7 7 6 6
## im going to in new york more than a one of the
## 6 6 6 6
## out of the the united states based on the part of the
## 6 6 5 5
## percent of the this is a to be a a lot of
## 5 5 5 4
## according to the any of the around the world be able to
## 4 4 4 4
## news_4grams
## in the united states from around the world 10 am to 4
## 4 3 2
## a scene in the a spokeswoman for the about equality of result
## 2 2 2
## allowed two runs on am to 4 pm and in the end
## 2 2 2
## and natural gas in are more likely to as well as a
## 2 2 2
## at a news conference at the community hall at the end of
## 2 2 2
## be found at the be required to disclose by tonight at midnight
## 2 2 2
## can be found at can do only so
## 2 2
Analyzing the n-grams gives us more insight into the content of each of these corpora. The phrases in the Twitter corpus mostly involve people talking about themselves or using social phrases such as “thanks for the”, “hope to see you”, and “love you so much”. In blogs and news, we start to see some common news phrases such as “president of the united states”.
This is only a basic analysis. Before I create my predictive text model, I will probably want to add some additional preprocesing steps. Some additional steps I would like to include are:
Storing all of these features will also require me to construct a faster method of preprocessing all of the lines.
Once I have a variety of features for each word such as previous word, previous n-grams, preceding punctuation, parts of speech and roots of previous words, I can use all of these features to build a variety of machine learning models which I will test the accuracy of using cross-validation. Tree-based models such as random forests come to mind immediately as being potentially useful but I will also explore other algorithms and research which ones are commonly used for this purpose.
There will be many instances where the user is typing something novel i.e. the previous word, or previous n-grams cannot be found in the corpus. In these cases, root word and part-of-speech features could be particularly useful, since the model needs to return a suggestion and it should have as many features available to it as possible, even if the data available does not match any existing patterns exactly. However, the model should still assign some degree of confidence to its predictions. In the case of a person typing nonsense, or typing in another language, the model should not return any suggestions.
Ideally the text prediction model should be able to learn from the user's distinctive language patterns, so a later, more advanced goal of the Shiny app could be to take the user's sentences as input and add them to the training set for the next time it generates the model.