This section is to perform a thorough exploratory analysis of the data, understand the distribution of words and relationship between the words in the corpora.
We can download the data from:
or we can download the in this way
## Once you get the download link, you can download the file by the following code.
download.file("url",destfile = "raw.zip")
unzip("raw.zip")
Anyway, we download the raw zip file, then unzip this file. We get several files, but we focus on dealing with the English files “en_US_twitter.txt”, “en_US_blogs.txt” and “en_US.news.txt”.
Three functions we will get to deal with the data:
getWords
: a list of sentences as input, returns a list of sentences with only words.
subList
: takes a list and an index vector, returns a the list indexed by the vector
subsetData
: takes a list and a number integer n, returns a random sublist with sample size divided by n.
The first thing I want to do is to select those charaters from “a-z” and “A-Z”. The following is a function to exact words in running text.
## This function recieves a list of charaters as input and returns a list of
## charaters only with a-z, A-Z, and space charater to get words.
getWords <- function(corpus){
n <- length(corpus) ## corpus is an list of sentences
x <- corpus
for (i in 1:n) {
x[[i]] <- gsub("[^a-zA-z ]", "", corpus[[i]])
}
x
}
We read the dataset and clean the data set. Dealing with the whole dataset is too much time consuming, so we just deal with a smaller sample of the each data.
con <- file("en_US.twitter.txt","r")
twitter <- readLines(con)
close(con)
## subset a list orilist with index vector x
subList <- function(x, orilist){
y <- list()
n <- length(x)
for (i in 1:n){
y[[i]] <- orilist[[x[i]]]
}
y
}
subsetData <- function(data, m = 100){
n <- length(data)
x <- sample(1:n, n/m)
subList(x, data)
}
set.seed(1)
twitterSamples <- getWords(subsetData(twitter, 1000))
Then the dataset twitterSamples
is a size 2360 sample with elementary cleaning from the training twitter data.
For the sake of time saving, we sometimes deal with the 2360 size sampled dataset twitterSamples
instead of the whole twitter dataset twitter
.
Functions we are going to use:
countSpace
, takes a sentence, returns the tokens in the sentencetokenCount
, count the tokens in a list of sentence.We give the table of size and lines of each English txt file.
con <- file("en_US.blogs.txt","r")
blogs <- readLines(con)
close(con)
con <- file("en_US.news.txt","r")
news <- readLines(con)
close(con)
countSpace <- function(x){
nchar(gsub("[^ ]", "", x))+1
}
tokenCount <- function(corpus){
n = length(corpus)
m = 0
for (i in 1:n) {
m = m + countSpace(corpus[[i]])
}
m
}
set.seed(1)
## tokens of the twitters
tokenCount(subsetData(twitter))
## [1] 302634
## tokens of the news
tokenCount(subsetData(news))
## [1] 27614
## tokens of the blogs
tokenCount(subsetData(blogs))
## [1] 368602
Num of tokens of each file is approximate equal 100 times the above number. The following table shows number of line and size of each dataset.
Statistic items | en_US.blogs.txt | en_US.blogs.txt | en_US.news.txt |
---|---|---|---|
Num of Lines | 2360148 | 899288 | 77259 |
Size of file(Mb) | 301.4 | 248.5 | 19.2 |
From the table, we can see that the twitter
file has more than ten times of lines over the lines of blogs
and news
. The news file has both the least lines and the least size.
A function we are going to use:
tokenize
, takes a list of sentences, return a list of tokensThe first step to calculate the frequencies is to split the sentences in corpus into tokens. The following is our function to do the work.
tokenize <- function(corpus){
sapply(corpus, function(x){strsplit(x, " ")})
}
Codes to apply the tokenize
function.
## split the sampled twitter data and news data
wordstwitter <- tokenize(twitterSamples)
wordsnews <- tokenize(getWords(news))
The tokenized news data wordsnews
have size of 154.9 Mb, which is much bigger that the origin data of 19.2 Mb. It means that we can’t taokenized the whole twitter dataset to do statistical summaries in case that it takes too much time to compute.
Functions defined and to be used:
freqTable
, takes a lisk of tokens, returns the frequency table of tokens.cumprobTable
, shows the culmulative probility sum of the highest n tokens of a list.twittertokens <- tokenize(twitterSamples)
freqTable <- function(inputList){
n <- length(inputList)
x <- rep(0,n)
xname <- names(x)
k <- 0
for (i in 1:n){
v <- inputList[[i]]
for (j in 1:length(v)){
if (v[j] %in% xname) {
x[v[j]] = x[v[j]]+1
}
else {
k = k + 1
xname[k] = v[j]
x[k] = 1
}
names(x) <- xname
}
}
sort(x, decreasing = T)
}
freqTable(twittertokens)[1:30]
## the to a I you and is for of in it my on that be
## 846 761 559 552 474 411 355 353 347 340 262 256 238 228 204
## are with at me your have this Im like out just so all i not
## 181 170 168 150 147 139 134 133 122 118 109 108 107 101 100
cumprobTable <- function(n=20,inputlist){
v <- freqTable(inputlist)
cumsum(v[1:n])/sum(v)
}
plot(cumprobTable(n= 500, twittertokens), type = "l", xlab = "Number of tokens", ylab = "Probability")
abline(h = 0.5, col = "red")
The very difficult of dealing with these corpus is time-consuming, and the solution to save time is to sample the whole dataset. In this section, we are focusing on calculating the frequences of different situations to build the n-grams model.
FreqToken
, takes a specified token and a copus, returns the probability of this token in this corpus.freqToken <- function(token, corpus, m = 1000){
li <- tolower(subsetData(corpus, m))
tokenList <- tokenize(getWords(li))
countToken <- function(token, x){
count <- 0
for (i in 1:length(x)){
if (identical(token, x[i])){
count = count + 1
}
}
count
}
sum(sapply(tokenList, function(x){countToken(token,x)}))/tokenCount(li)
}
highFreqToken
, takes a copus, a divided number m, number of tokens to show, returns the probability of tokens with highest probability.highFreqTokenTable <- function(corpus, m = 1000, n = 20){
li <- tolower(subsetData(corpus, m))
tokenList <- tokenize(getWords(li))
freqTable(tokenList)[1:n]/sum(freqTable(tokenList))
}
We have done some basic summaries in Section 1, here we are doing more things.
highFreqTokenTable(twitter, m = 5000, n = 20)
## the to i you a and
## 0.029381265 0.026097477 0.025406153 0.020394055 0.018492914 0.015727618
## for in on of is it
## 0.012443830 0.012098168 0.011752506 0.011234013 0.011061182 0.010888351
## my that me be your was
## 0.010715520 0.008468718 0.007604563 0.007258901 0.006913239 0.006394746
## just so
## 0.006394746 0.006221915
highFreqTokenTable(blogs, m = 2000, n = 20)
## the to and a of i
## 0.048134062 0.029771809 0.029236986 0.026681721 0.023116235 0.020204421
## in is it that for you
## 0.014380794 0.012300927 0.011647255 0.011468980 0.009805087 0.009567388
## was with this as on have
## 0.007130972 0.007130972 0.006536725 0.006358450 0.006180176 0.006061326
## my not
## 0.005942477 0.005823627
highFreqTokenTable(news, m = 1000, n = 20)
## the and a to of in
## 0.056818182 0.031524927 0.030791789 0.028592375 0.022727273 0.021627566
## for that on is with be
## 0.012463343 0.011730205 0.010263930 0.008797654 0.008797654 0.007697947
## said it was from he have
## 0.007697947 0.006964809 0.006964809 0.006231672 0.005131965 0.004765396
## but not
## 0.004765396 0.004032258
From the results, we can see that the word “I” is not very frequent in the dataset news
, but very frequent in twitter
and blogs
. The prob “I” appeared in news is 0.0033149, which is much lower that 0.0200847 and 0.0232843. This reflect a fact that news are asked to objective.
We can show more differece between the three datasets with these two functions, which reflecting features of twitter, blog and news, but let’s turn our focuses on prediction models.
Given a corpus such as twitter
and a token like “the” , what is the frequency table of those tokens after the specified token “the” for corpus twitter
? we are going to build a function to get the frequency table.
tokenProducedFreqTable <- function(token, corpus){
li = tolower(subsetData(corpus))
if (sum(grepl(token,li))) {
x = grep(token, li)
subLi = subList(x,li)
subLi = tokenize(getWords(subLi))
newLi = subLi
for (i in 1:length(x)) {
newLi[[i]] <- subLi[[i]][grep(token, subLi[[i]])+1]
}
freqTable(newLi)[1:6]
}
else {
x = grep(token, corpus)
subLi = tolower(subList(x,corpus))
subLi = tokenize(getWords(subLi))
newLi = subLi
for (i in 1:length(x)) {
newLi[[i]] <- subLi[[i]][grep(token, subLi[[i]])+1]
}
freqTable(newLi)[1:5]
}
freqTable(newLi)[1:6]
}
Let’s see what we can do with this function
We would like to predict the word after “you” with the twitter
dataset, then we just print
tokenProducedFreqTable("you", twitter)
## have are can know for dont
## 219 216 171 135 112 101
to get the prediction of the words of “are”, “have”, “can”, “know”, “for” in decreasing probability.
but if we predict the word after “you” with a different training set such as news
, then
tokenProducedFreqTable("you", news)
## can have will would need that
## 5 4 3 3 3 2
Your prediction would be differet.
The prediction model above only use the information of previous words, this model uses information both the previous word and the second words before the word.
tokenProducedFreqTable2 <- function(token, corpus){
li = tolower(subsetData(corpus))
if (sum(grepl(token,li))) {
x = grep(token, li)
subLi = subList(x,li)
subLi = tokenize(getWords(subLi))
newLi = subLi
for (i in 1:length(x)) {
newLi[[i]] <- subLi[[i]][grep(token, subLi[[i]])+2]
}
freqTable(newLi)[1:6]
}
else {
x = grep(token, corpus)
subLi = tolower(subList(x,corpus))
subLi = tokenize(getWords(subLi))
newLi = subLi
for (i in 1:length(x)) {
newLi[[i]] <- subLi[[i]][grep(token, subLi[[i]])+2]
}
freqTable(newLi)[1:5]
}
freqTable(newLi)[1:6]
}
Our main difficult is that the corpus size is too big, so we need a lot of time and memories. I could only deal with a random sample in certain situations. we could save the words with a dataframe with the fisrt column is the token and the second column is the frequency of the words.
Our algorithm to predict is:
twoGramsPredict <- function(token, corpus){
data <- tolower(subsetData(corpus))
if (sum(grepl(token,data)) > 20) {
x = grep(token, data)
subLi = subList(x,data)
subLi = tokenize(getWords(subLi))
newLi = subLi
for (i in 1:length(x)) {
newLi[[i]] <- subLi[[i]][grep(token, subLi[[i]])+1]
}
freqTable(newLi)[1]
}
else {
x = grep(token, corpus)
subLi = tolower(subList(x,corpus))
subLi = tokenize(getWords(subLi))
newLi = subLi
for (i in 1:length(x)) {
newLi[[i]] <- subLi[[i]][grep(token, subLi[[i]])+1]
}
freqTable(newLi)[1]
}
}
twoGramsPredict("the", news)
## same
## 21
threeGramsPredict <- function(token2,token1, corpus){
data <- tolower(subsetData(corpus))
if (sum(grepl(token1,data)) > 20) {
x = grep(token1, data)
subLi = subList(x,data)
subLi = tokenize(getWords(subLi))
newLi1 = subLi
for (i in 1:length(x)) {
newLi1[[i]] <- subLi[[i]][grep(token1, subLi[[i]])+1]
}
a = freqTable(newLi1)/ sum(freqTable(newLi1))
}
else {
x = grep(token1, corpus)
subLi = tolower(subList(x,corpus))
subLi = tokenize(getWords(subLi))
newLi1 = subLi
for (i in 1:length(x)) {
newLi1[[i]] <- subLi[[i]][grep(token1, subLi[[i]])+1]
}
a = freqTable(newLi1)/ sum(freqTable(newLi1))
}
if (sum(grepl(paste(token2, token1),data)) > 20) {
x = grep(paste(token2, token1), data)
subLi = subList(x,data)
subLi = tokenize(getWords(subLi))
newLi2 = subLi
for (i in 1:length(x)) {
newLi2[[i]] <- subLi[[i]][grep(token1, subLi[[i]])+1]
}
b = freqTable(newLi2)/ sum(freqTable(newLi2))
}
else {
x = grep(paste(token2, token1), corpus)
subLi = tolower(subList(x,corpus))
subLi = tokenize(getWords(subLi))
newLi2 = subLi
for (i in 1:length(x)) {
newLi2[[i]] <- subLi[[i]][grep(token1, subLi[[i]])+1]
}
b = freqTable(newLi2)/ sum(freqTable(newLi2))
}
if (a>b) {a[1]}
else {b[1]}
}
threeGramsPredict("in", "the", news)
## first
## 0.02291667