We will assume that the text data is contained in a sub folder “en_US” of the current working directory.
library(tm)
## Loading required package: NLP
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidytext)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
set.seed(25)
tweets <- readLines('en_US/en_US.twitter.txt', encoding = 'UTF-8')
## Warning in readLines("en_US/en_US.twitter.txt", encoding = "UTF-8"): line
## 167155 appears to contain an embedded nul
## Warning in readLines("en_US/en_US.twitter.txt", encoding = "UTF-8"): line
## 268547 appears to contain an embedded nul
## Warning in readLines("en_US/en_US.twitter.txt", encoding = "UTF-8"): line
## 1274086 appears to contain an embedded nul
## Warning in readLines("en_US/en_US.twitter.txt", encoding = "UTF-8"): line
## 1759032 appears to contain an embedded nul
news <- readLines('en_US/en_US.news.txt', encoding = 'UTF-8')
blogs <- readLines('en_US/en_US.blogs.txt', encoding = 'UTF-8')
To get some idea of the scale of our data set, we will need to at least know the number of lines in each of our text files.
length(tweets)
## [1] 2360148
length(news)
## [1] 1010242
length(blogs)
## [1] 899288
We certainly won’t be able to use the complete corpus of any of the data sets as their size would significantly slow down our work. As such, we elect to sample a subset randomly from each of our corpora. It is with these samples that we will begin to count the number of words as well as the number of various n-grams. From these counts we will later generate sample statistics which we hope will sufficiently model to global set of data.
tweet_lines <- gsub("[^[:alnum:] ]", "", tolower(sample(tweets, 10000, replace = FALSE )))
news_lines <- gsub("[^[:alnum:] ]", "", tolower(sample(news, 10000, replace = FALSE )))
blog_lines <- gsub("[^[:alnum:] ]", "", tolower(sample(blogs, 10000, replace = FALSE )))
Given a corpus of data, an n-gram is a collection of n tokens which occur in succession within the data. In the case of text files, a common example of 2-gram may be “of the”, and a common 3-gram may be “one of the”. Since our ultimate goal will be to predict what word a user is about to type given some collection of words they have already typed, we naturally should study the value of n-grams. Suppose, for example that we have seen someone type the phrase “one of” and we wish to predict what word they will type next. We could search all of our 3-grams to identify the elements that start with the phrase “one of” and find the element with the highest frequency in our training set. We could then return the third word in that particular 3-gram as our prediction.
Fortunately for us, there exists a handy R package which will generate n-grams relatively easily. Below we generate the sets of 1-grams, 2-grams, and 3-grams for each of our three data sets.
tweet_lines_df <- data.frame(tweet_lines)
tweet_onegrams <- unnest_tokens(tweet_lines_df, onegrams, tweet_lines, token = "ngrams", n = 1)
tweet_twograms <- unnest_tokens(tweet_lines_df, twograms, tweet_lines, token = "ngrams", n = 2)
tweet_threegrams <- unnest_tokens(tweet_lines_df, threegrams, tweet_lines, token = "ngrams", n = 3)
news_lines_df <- data.frame(news_lines)
news_onegrams <- unnest_tokens(news_lines_df, onegrams, news_lines, token = "ngrams", n = 1)
news_twograms <- unnest_tokens(news_lines_df, twograms, news_lines, token = "ngrams", n = 2)
news_threegrams <- unnest_tokens(news_lines_df, threegrams, news_lines, token = "ngrams", n = 3)
blog_lines_df <- data.frame(blog_lines)
blog_onegrams <- unnest_tokens(blog_lines_df, onegrams, blog_lines, token = "ngrams", n = 1)
blog_twograms <- unnest_tokens(blog_lines_df, twograms, blog_lines, token = "ngrams", n = 2)
blog_threegrams <- unnest_tokens(blog_lines_df, threegrams, blog_lines, token = "ngrams", n = 3)
Once we have the collections of n-grams, it remains only to count the number of occurrences of each n-gram in each corpus. We make use of the count function for this task.
tweet_onegram_counts <- count(tweet_onegrams, onegrams, sort = TRUE)
tweet_twogram_counts <- count(tweet_twograms, twograms, sort = TRUE)
tweet_threegram_counts <- count(tweet_threegrams, threegrams, sort = TRUE)
news_onegram_counts <- count(news_onegrams, onegrams, sort = TRUE)
news_twogram_counts <- count(news_twograms, twograms, sort = TRUE)
news_threegram_counts <- count(news_threegrams, threegrams, sort = TRUE)
blog_onegram_counts <- count(blog_onegrams, onegrams, sort = TRUE)
blog_twogram_counts <- count(blog_twograms, twograms, sort = TRUE)
blog_threegram_counts <- count(blog_threegrams, threegrams, sort = TRUE)
Finally, we generate plots of our results so that we can build an understanding of some of the potential differences in our various data sets. In particular, we expect the tweet data to be fundamentally different due to its character length restriction.
t1g <- ggplot(tweet_onegram_counts[1:25,], aes(x=reorder(onegrams, n), y=n)) +
geom_bar(stat = "identity") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("onegram") + ylab("Frequency") +
labs(title = "25 Most Common Tweet Onegrams")
t2g <- ggplot(tweet_twogram_counts[1:25,], aes(x=reorder(twograms, n), y=n)) +
geom_bar(stat = "identity") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("twogram") + ylab("Frequency") +
labs(title = "25 Most Common Tweet Twograms")
t3g <- ggplot(tweet_threegram_counts[1:25,], aes(x=reorder(threegrams, n), y=n)) +
geom_bar(stat = "identity") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("threegram") + ylab("Frequency") +
labs(title = "25 Most Common Tweet Threegrams")
n1g <- ggplot(news_onegram_counts[1:25,], aes(x=reorder(onegrams, n), y=n)) +
geom_bar(stat = "identity") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("onegram") + ylab("Frequency") +
labs(title = "25 Most Common News Onegrams")
n2g <- ggplot(news_twogram_counts[1:25,], aes(x=reorder(twograms, n), y=n)) +
geom_bar(stat = "identity") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("twogram") + ylab("Frequency") +
labs(title = "25 Most Common News Twograms")
n3g <- ggplot(news_threegram_counts[1:25,], aes(x=reorder(threegrams, n), y=n)) +
geom_bar(stat = "identity") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("threegram") + ylab("Frequency") +
labs(title = "25 Most Common News Threegrams")
b1g <- ggplot(blog_onegram_counts[1:25,], aes(x=reorder(onegrams, n), y=n)) +
geom_bar(stat = "identity") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("onegram") + ylab("Frequency") +
labs(title = "25 Most Common Blog Onegrams")
b2g <- ggplot(blog_twogram_counts[1:25,], aes(x=reorder(twograms, n), y=n)) +
geom_bar(stat = "identity") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("twogram") + ylab("Frequency") +
labs(title = "25 Most Common Blog Twograms")
b3g <- ggplot(blog_threegram_counts[1:25,], aes(x=reorder(threegrams, n), y=n)) +
geom_bar(stat = "identity") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("threegram") + ylab("Frequency") +
labs(title = "25 Most Common Blog Threegrams")
Now that we have generated the plots, we will plot them to the screen.
print(t1g)
print(n1g)
print(b1g)
print(t2g)
print(n2g)
print(b2g)
print(t3g)
print(n3g)
print(b3g)
As anticipated we see that the 3-grams for the tweet data set has some interesting differences which seem to relate to then more relaxed informal conversational nature of the tweets. In the case of the tweets, the most common 3-gram in our sample set was “thanks for the” where the most common 3-gram for each of the other data sets was the more generic phrase “one of the”.
Now that we have collected our sets of n-grams, it should be relatively simple to generate probability tables for each of them. We will then use these tables, as well as possibly a few other techniques, to build our predictive models.