This report illustrates the use of some standard Natural Language Processing (NLP) tasks on datasets of natural language text corpora in English, derived from web scraping of twitter, news feeds and blogs. this includes deriving basic data summaries, tokenisation, filtering and construction / ranking of n-grams.
We read the three corpora into R using the readLines function of the tm package:
## Read the three corpora into R using readLines
suppressMessages(library(tm))
con <- file("en_US.twitter.txt", "r")
t <- readLines(con, encoding="UTF-8")
close(con)
con <- file("en_US.news.txt", "r")
n <- readLines(con, encoding="UTF-8")
close(con)
con <- file("en_US.blogs.txt", "r")
b <- readLines(con, encoding="UTF-8")
close(con)
Firstly we review the filesize in Mb of the three corpora:
## Obtain size in Mb of three corpora
file.info("en_US.twitter.txt")$size/1024^2
## [1] 159.3641
file.info("en_US.news.txt")$size/1024^2
## [1] 196.2775
file.info("en_US.blogs.txt")$size/1024^2
## [1] 200.4242
We measure the number of lines in each corpus:
## Obtain the number of lines in each corpus
twitter_lines <- length(t); twitter_lines
## [1] 2360148
news_lines <- length(n); news_lines
## [1] 77259
blogs_lines <- length(b); blogs_lines
## [1] 899288
We count the number of words in each corpus using the stringi package:
## Obtain the number of words in each corpus using the stringi package
suppressMessages(library(stringi))
twitter_words <- sum(stri_count_words(t)); twitter_words
## [1] 30093372
news_words <- sum(stri_count_words(n)); news_words
## [1] 2674536
blogs_words <- sum(stri_count_words(b)); blogs_words
## [1] 37546239
We obtain the number of characters of each corpus:
## Obtain the number of characters in each corpus (including whitespace)
twitter_char <-nchar(t); twitter_chars <- sum(twitter_char); twitter_chars
## [1] 162096031
news_char <-nchar(n); news_chars <- sum(news_char); news_chars
## [1] 15639408
blogs_char <-nchar(b); blogs_chars <- sum(blogs_char); blogs_chars
## [1] 206824505
In order to undertake further exploratory analysis via tokenisation and creation of n-grams, we select a subset of 5,000 lines for each of the corpora. This speeds up the subsequent processes and ensures that we operate within the memory constraints of our working environment:
## Subset to 5,000 lines of each dataset for exploratory analysis
con <- file("en_US.twitter.txt", "r")
t1 <- readLines(con, 5000, encoding="UTF-8")
close(con)
con <- file("en_US.news.txt", "r")
n1 <- readLines(con, 5000, encoding="UTF-8")
close(con)
con <- file("en_US.blogs.txt", "r")
b1 <- readLines(con, 5000, encoding="UTF-8")
close(con)
To facilitate profanity filtering in our tokenisation and n-gram creation, a profanity listing in English is sourced from GitHub and read into R using the readLines function:
## Load in a data set of profanity to support filtering of profanity from datasets for further processing
## This profanity listing can be found at:
## https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en
con <- file("badwords.txt", "r")
bad <- readLines(con, encoding="UTF-8")
close(con)
Now we filter the subsetted corpora to facilitate tokenisation. In this first instance we retain stopwords in the corpora:
## Filter sample corpora by removing whitespace, numbers, punctuation and profanity
## Convert sample corpora to lowercase
vt1 <- VCorpus(VectorSource(t1))
vt1 <- tm_map(vt1, stripWhitespace)
vt1 <- tm_map(vt1, content_transformer(tolower))
vt1 <- tm_map(vt1, removeWords, bad)
vt1 <- tm_map(vt1, removePunctuation)
vt1 <- tm_map(vt1, removeNumbers)
vn1 <- VCorpus(VectorSource(n1))
vn1 <- tm_map(vn1, stripWhitespace)
vn1 <- tm_map(vn1, content_transformer(tolower))
vn1 <- tm_map(vn1, removeWords, bad)
vn1 <- tm_map(vn1, removePunctuation)
vn1 <- tm_map(vn1, removeNumbers)
vb1 <- VCorpus(VectorSource(b1))
vb1 <- tm_map(vb1, stripWhitespace)
vb1 <- tm_map(vb1, content_transformer(tolower))
vb1 <- tm_map(vb1, removeWords, bad)
vb1 <- tm_map(vb1, removePunctuation)
vb1 <- tm_map(vb1, removeNumbers)
Using the RWeka package, we create functions to produce n-grams of 1, 2 and 3 terms in length.Then we create a function that will take the filtered subsetted corpora and one of the n-gram functions and output a term-document matrix ordered by the frequency of n-gram terms:
## Create n-gram functions for 1, 2 and 3 terms
## param: corpus of natural language text
suppressMessages(library(RWeka))
nGram1 <- function(c){NGramTokenizer(c, Weka_control(min=1,max=1))}
nGram2 <- function(c){NGramTokenizer(c, Weka_control(min=2,max=2))}
nGram3 <- function(c){NGramTokenizer(c, Weka_control(min=3,max=3))}
## Create a function that returns a term document matrix ordered by term frequency
## param: corpus of natural language text
## param: n-gram function
nlist <- function(c, n) {
tdm <- TermDocumentMatrix(c, control = list(tokenize = n))
t <- rowSums(as.matrix(tdm))
nt <- data.frame(n=names(t),num=t)
nt <- nt[order(-nt$num),]
}
We apply our functions to the filtered corpora. The top 10 terms for each function are listed. Note that these include stopwords:
## Produce term-document matrices of the filtered corpora and show the top ten terms for each
t11 <- nlist(vt1, nGram1); head(t11,10)
## n num
## the the 1959
## you you 1136
## and and 911
## for for 801
## that that 531
## with with 368
## this this 358
## are are 336
## your your 335
## have have 333
t12 <- nlist(vt1, nGram2); head(t12,10)
## n num
## in the in the 163
## for the for the 150
## of the of the 128
## to be to be 119
## on the on the 99
## thanks for thanks for 99
## to the to the 89
## i love i love 85
## have a have a 80
## going to going to 78
t13 <- nlist(vt1, nGram3); head(t13,10)
## n num
## thanks for the thanks for the 50
## i love you i love you 23
## going to be going to be 21
## i want to i want to 18
## thank you for thank you for 18
## cant wait to cant wait to 17
## i have a i have a 16
## a lot of a lot of 15
## for the follow for the follow 14
## to see you to see you 14
n11 <- nlist(vn1, nGram1); head(n11,10)
## n num
## the the 9620
## and and 4346
## for for 1844
## that that 1649
## with with 1271
## said said 1241
## was was 1093
## but but 778
## from from 772
## his his 758
n12 <- nlist(vn1, nGram2); head(n12,10)
## n num
## in the in the 884
## of the of the 877
## to the to the 383
## on the on the 364
## for the for the 353
## at the at the 269
## and the and the 259
## in a in a 255
## to be to be 244
## with the with the 192
n13 <- nlist(vn1, nGram3); head(n13,10)
## n num
## one of the one of the 67
## a lot of a lot of 52
## going to be going to be 31
## according to the according to the 30
## as well as as well as 30
## be able to be able to 28
## part of the part of the 27
## the end of the end of 26
## to be a to be a 25
## it was a it was a 23
b11 <- nlist(vb1, nGram1); head(b11,10)
## n num
## the the 10207
## and and 5997
## that that 2568
## for for 1941
## you you 1623
## with with 1620
## was was 1510
## this this 1414
## have have 1195
## but but 1147
b12 <- nlist(vb1, nGram2); head(b12,10)
## n num
## of the of the 1000
## in the in the 846
## to the to the 501
## to be to be 385
## on the on the 383
## for the for the 332
## and the and the 325
## and i and i 286
## it was it was 265
## with the with the 265
b13 <- nlist(vb1, nGram3); head(b13,10)
## n num
## a lot of a lot of 82
## one of the one of the 76
## some of the some of the 39
## to be a to be a 39
## i want to i want to 36
## it was a it was a 36
## i have to i have to 33
## the end of the end of 32
## a couple of a couple of 31
## out of the out of the 31
Now we produce new filtered versions of the subsetted corpora. In this case we remove stopwords alongside the other filters that were used for the first filtering exercise:
## Filter sample corpora by removing whitespace, numbers, punctuation, stopwords and profanity
## Convert sample corpora to lowercase
vt2 <- VCorpus(VectorSource(t1))
vt2 <- tm_map(vt2, stripWhitespace)
vt2 <- tm_map(vt2, content_transformer(tolower))
vt2 <- tm_map(vt2, removeWords, bad)
vt2 <- tm_map(vt2, removeWords, stopwords("english"))
vt2 <- tm_map(vt2, removePunctuation)
vt2 <- tm_map(vt2, removeNumbers)
vn2 <- VCorpus(VectorSource(n1))
vn2 <- tm_map(vn2, stripWhitespace)
vn2 <- tm_map(vn2, content_transformer(tolower))
vn2 <- tm_map(vn2, removeWords, bad)
vn2 <- tm_map(vn2, removeWords, stopwords("english"))
vn2 <- tm_map(vn2, removePunctuation)
vn2 <- tm_map(vn2, removeNumbers)
vb2 <- VCorpus(VectorSource(b1))
vb2 <- tm_map(vb2, stripWhitespace)
vb2 <- tm_map(vb2, content_transformer(tolower))
vb2 <- tm_map(vb2, removeWords, bad)
vb2 <- tm_map(vb2, removeWords, stopwords("english"))
vb2 <- tm_map(vb2, removePunctuation)
vb2 <- tm_map(vb2, removeNumbers)
We apply our functions to the second batch of filtered corpora. we also subset the resultant term-document matrices to obtain the ten most frequent n-grams in each, to facilitate plotting:
## Produce term-document matrices of the second batch of filtered corpora and
## subset the matrices to the top ten most frequent n-grams in each one
t21 <- nlist(vt2, nGram1)
t22 <- nlist(vt2, nGram2)
t23 <- nlist(vt2, nGram3)
n21 <- nlist(vn2, nGram1)
n22 <- nlist(vn2, nGram2)
n23 <- nlist(vn2, nGram3)
b21 <- nlist(vb2, nGram1)
b22 <- nlist(vb2, nGram2)
b23 <- nlist(vb2, nGram3)
t21a <- t21[1:10,]; t22a <- t22[1:10,]; t23a <- t23[1:10,]
n21a <- n21[1:10,]; n22a <- n22[1:10,]; n23a <- n23[1:10,]
b21a <- b21[1:10,]; b22a <- b22[1:10,]; b23a <- b23[1:10,]
Now we begin plotting the ten most frequent n-grams in the filtered corpora, using the ggplot2 and gridExtra packages.
First we show the 1- 2- and 3-grams for the twitter corpus:
suppressMessages(library(ggplot2)); suppressMessages(library(gridExtra))
pt1 <- ggplot(t21a, aes(num,n)) + geom_col()
pt2 <- ggplot(t22a, aes(num,n)) + geom_col()
pt3 <- ggplot(t23a, aes(num,n)) + geom_col()
grid.arrange(pt1,pt2,pt3,nrow=1)
Next we show the 1- 2- and 3-grams for the news corpus:
pn1 <- ggplot(n21a, aes(num,n)) + geom_col()
pn2 <- ggplot(n22a, aes(num,n)) + geom_col()
pn3 <- ggplot(n23a, aes(num,n)) + geom_col()
grid.arrange(pn1,pn2,pn3,nrow=1)
Finally we show the 1- 2- and 3-grams for the blogs corpus:
pb1 <- ggplot(b21a, aes(num,n)) + geom_col()
pb2 <- ggplot(b22a, aes(num,n)) + geom_col()
pb3 <- ggplot(b23a, aes(num,n)) + geom_col()
grid.arrange(pb1,pb2,pb3,nrow=1)
There are clear distinctions in the n-grams obtained between the twitter, news and blogs corpora.
The twitter corpus is typically less formal and more conversational, with shorter words and contracted words being used more often than in the other corpora. More emotive n-grams such as love are more prominent than in other corpora.
The news corpus shows a reasonably strong emphasis on places, prominent people and organisations, and what people have been recorded as saying.
The blogs corpus appears to take on a role of personal narrative; the n-grams are more formal and slightly more sophisticated than the twitter corpus, but still more personal than the news corpus.
These distinctions become clearer by removing stopwords, although a careful enough study of the n-grams including stopwords can still give similar impressions.
Some oddities show up in the n-grams (eg repetition of the single character āuā in the news n-grams) indicating that further work on tokenisation, filtering and preparing the corpora for n-gram construction is warranted.
I am considering development of a prediction algorithm using the markovchain package in R, as there are indications that this is relatively simple and light on computational power compared to some other methods.
Stupid back-off seems to be a realistic method to apply to the problem of predicting when a phrase has not been encountered in the corpus utilised for prediction. If this proves too difficult or not particularly effective, some form of probabilistic substitution could be attempted instead.
There are indications that the quanteda package may work more efficiently than the tm package for NLP tasks, so I may have a look into this when developing the prediction model.