The motivation for this report is to:
With the help of wc (https://en.wikipedia.org/wiki/Wc_(Unix)) linux command we find the following:
filesSummary
## file linescount wordscount charactercount
## 1 twitter 2360148 30359852 166843164
## 2 news 1009907 34365936 205243643
## 3 blogs 898584 37334114 208623085
However if we open the files we will see that there are many issues we have to face when creating ngrams to build our model:
For these reasons we have to clean and tidy our data
With the help of unix tools:
* tr
* awk
* sed
the author decided to:
Having cleaned our files the author moved to create for each data file a new file containing unique word frequencies. E.g for twitter the first two lines are:
936331, the
787437, to
library(data.table)
twitWords<-fread("twitter.unicount.txt",sep=",",header=F,colClasses = c("numeric","character"),stringsAsFactors = F,na.strings = "NA")
setnames(twitWords, names(twitWords), c("freq","word"))
newsWords<-fread("news.unicount.txt",sep=",",header=F,colClasses = c("numeric","character"),stringsAsFactors = F,na.strings = "NA")
setnames(newsWords, names(newsWords), c("freq","word"))
blogsWords<-fread("blogs.unicount.txt",sep=",",header=F,colClasses = c("numeric","character"),stringsAsFactors = F,na.strings = "NA")
setnames(blogsWords, names(blogsWords), c("freq","word"))
We move on to explore our tables
head(twitWords,10)
## freq word
## 1: 936331 the
## 2: 787437 to
## 3: 726051 i
## 4: 614545 a
## 5: 548560 you
## 6: 438176 and
## 7: 385140 for
## 8: 378136 in
## 9: 359232 of
## 10: 358756 is
head(newsWords,10)
## freq word
## 1: 1971977 the
## 2: 901179 to
## 3: 891338 a
## 4: 885278 and
## 5: 771093 of
## 6: 674213 in
## 7: 351244 for
## 8: 350523 that
## 9: 284130 is
## 10: 266868 on
head(blogsWords,10)
## freq word
## 1: 1857944 the
## 2: 1092742 and
## 3: 1066296 to
## 4: 901848 a
## 5: 875332 of
## 6: 832173 i
## 7: 594343 in
## 8: 472671 that
## 9: 442779 it
## 10: 432459 is
We note the close similarity for the top results in each file
We move on to see the distribution of our words:
summary(twitWords)
## freq word
## Min. : 1.0 Length:388043
## 1st Qu.: 1.0 Class :character
## Median : 1.0 Mode :character
## Mean : 76.8
## 3rd Qu.: 3.0
## Max. :936331.0
summary(newsWords)
## freq word
## Min. : 1.0 Length:319395
## 1st Qu.: 1.0 Class :character
## Median : 2.0 Mode :character
## Mean : 106.1
## 3rd Qu.: 6.0
## Max. :1971977.0
summary(blogsWords)
## freq word
## Min. : 1.0 Length:361688
## 1st Qu.: 1.0 Class :character
## Median : 1.0 Mode :character
## Mean : 103.4
## 3rd Qu.: 4.0
## Max. :1857944.0
We note that as expected our text is dominated by a small percentage of the total words
We can also see from the following histogram that our data is highly skewed
hist(twitWords[,freq])
The same result with a simple plot
plot(twitWords[,freq])
Even if we restrict ourselves to the top 100 words we have high skewness
plot(twitWords[1:100,freq])
The author created ngrams for each text file.
Again the distribution is very skewed. We should take this into account when building our model and use only frequent ngrams where only the most frequent words appear