Capstone Report Wk 2 LG

Data Science Capstone Week 2 Report

This report illustrates the use of some standard Natural Language Processing (NLP) tasks on datasets of natural language text corpora in English, derived from web scraping of twitter, news feeds and blogs. this includes deriving basic data summaries, tokenisation, filtering and construction / ranking of n-grams.

We read the three corpora into R using the readLines function of the tm package:

## Read the three corpora into R using readLines
suppressMessages(library(tm))
con <- file("en_US.twitter.txt", "r")
t <- readLines(con, encoding="UTF-8")
close(con)
con <- file("en_US.news.txt", "r")
n <- readLines(con, encoding="UTF-8")
close(con)
con <- file("en_US.blogs.txt", "r")
b <- readLines(con, encoding="UTF-8")
close(con)

1. Basic Summary Information

1.1. Corpora file size

Firstly we review the filesize in Mb of the three corpora:

## Obtain size in Mb of three corpora
file.info("en_US.twitter.txt")$size/1024^2

## [1] 159.3641

file.info("en_US.news.txt")$size/1024^2

## [1] 196.2775

file.info("en_US.blogs.txt")$size/1024^2

## [1] 200.4242

1.2. Corpora Line Count

We measure the number of lines in each corpus:

## Obtain the number of lines in each corpus
twitter_lines <- length(t); twitter_lines

## [1] 2360148

news_lines <- length(n); news_lines

## [1] 77259

blogs_lines <- length(b); blogs_lines

## [1] 899288

1.3. Corpora Word Count

We count the number of words in each corpus using the stringi package:

## Obtain the number of words in each corpus using the stringi package
suppressMessages(library(stringi))
twitter_words <- sum(stri_count_words(t)); twitter_words

## [1] 30093372

news_words <- sum(stri_count_words(n)); news_words

## [1] 2674536

blogs_words <- sum(stri_count_words(b)); blogs_words

## [1] 37546239

1.4. Corpora Character count

We obtain the number of characters of each corpus:

## Obtain the number of characters in each corpus (including whitespace)
twitter_char <-nchar(t); twitter_chars <- sum(twitter_char); twitter_chars

## [1] 162096031

news_char <-nchar(n); news_chars <- sum(news_char); news_chars

## [1] 15639408

blogs_char <-nchar(b); blogs_chars <- sum(blogs_char); blogs_chars

## [1] 206824505

2. Tokenisation and N-Grams

2.1. Subsetting the Corpora

In order to undertake further exploratory analysis via tokenisation and creation of n-grams, we select a subset of 5,000 lines for each of the corpora. This speeds up the subsequent processes and ensures that we operate within the memory constraints of our working environment:

## Subset to 5,000 lines of each dataset for exploratory analysis
con <- file("en_US.twitter.txt", "r")
t1 <- readLines(con, 5000, encoding="UTF-8")
close(con)
con <- file("en_US.news.txt", "r")
n1 <- readLines(con, 5000, encoding="UTF-8")
close(con)
con <- file("en_US.blogs.txt", "r")
b1 <- readLines(con, 5000, encoding="UTF-8")
close(con)

2.2. Profanity Filtering

To facilitate profanity filtering in our tokenisation and n-gram creation, a profanity listing in English is sourced from GitHub and read into R using the readLines function:

## Load in a data set of profanity to support filtering of profanity from datasets for further processing
## This profanity listing can be found at:
## https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en
con <- file("badwords.txt", "r")
bad <- readLines(con, encoding="UTF-8")
close(con)

2.3. Corpora Filtering - Version 1

Now we filter the subsetted corpora to facilitate tokenisation. In this first instance we retain stopwords in the corpora:

## Filter sample corpora by removing whitespace, numbers, punctuation and profanity
## Convert sample corpora to lowercase
vt1 <- VCorpus(VectorSource(t1))
vt1 <- tm_map(vt1, stripWhitespace)
vt1 <- tm_map(vt1, content_transformer(tolower))
vt1 <- tm_map(vt1, removeWords, bad)
vt1 <- tm_map(vt1, removePunctuation)
vt1 <- tm_map(vt1, removeNumbers)

vn1 <- VCorpus(VectorSource(n1))
vn1 <- tm_map(vn1, stripWhitespace)
vn1 <- tm_map(vn1, content_transformer(tolower))
vn1 <- tm_map(vn1, removeWords, bad)
vn1 <- tm_map(vn1, removePunctuation)
vn1 <- tm_map(vn1, removeNumbers)

vb1 <- VCorpus(VectorSource(b1))
vb1 <- tm_map(vb1, stripWhitespace)
vb1 <- tm_map(vb1, content_transformer(tolower))
vb1 <- tm_map(vb1, removeWords, bad)
vb1 <- tm_map(vb1, removePunctuation)
vb1 <- tm_map(vb1, removeNumbers)

2.4. Functions for N-Grams

Using the RWeka package, we create functions to produce n-grams of 1, 2 and 3 terms in length.Then we create a function that will take the filtered subsetted corpora and one of the n-gram functions and output a term-document matrix ordered by the frequency of n-gram terms:

## Create n-gram functions for 1, 2 and 3 terms
## param: corpus of natural language text
suppressMessages(library(RWeka))
nGram1 <- function(c){NGramTokenizer(c, Weka_control(min=1,max=1))}
nGram2 <- function(c){NGramTokenizer(c, Weka_control(min=2,max=2))}
nGram3 <- function(c){NGramTokenizer(c, Weka_control(min=3,max=3))}

## Create a function that returns a term document matrix ordered by term frequency
## param: corpus of natural language text
## param: n-gram function
nlist <- function(c, n) {
  tdm <- TermDocumentMatrix(c, control = list(tokenize = n))
  t <- rowSums(as.matrix(tdm))
  nt <- data.frame(n=names(t),num=t)
  nt <- nt[order(-nt$num),]
}

2.5. N-Grams Produced (Version 1 - With Stopwords)

We apply our functions to the filtered corpora. The top 10 terms for each function are listed. Note that these include stopwords:

## Produce term-document matrices of the filtered corpora and show the top ten terms for each
t11 <- nlist(vt1, nGram1); head(t11,10)

##         n  num
## the   the 1959
## you   you 1136
## and   and  911
## for   for  801
## that that  531
## with with  368
## this this  358
## are   are  336
## your your  335
## have have  333

t12 <- nlist(vt1, nGram2); head(t12,10)

##                     n num
## in the         in the 163
## for the       for the 150
## of the         of the 128
## to be           to be 119
## on the         on the  99
## thanks for thanks for  99
## to the         to the  89
## i love         i love  85
## have a         have a  80
## going to     going to  78

t13 <- nlist(vt1, nGram3); head(t13,10)

##                             n num
## thanks for the thanks for the  50
## i love you         i love you  23
## going to be       going to be  21
## i want to           i want to  18
## thank you for   thank you for  18
## cant wait to     cant wait to  17
## i have a             i have a  16
## a lot of             a lot of  15
## for the follow for the follow  14
## to see you         to see you  14

n11 <- nlist(vn1, nGram1); head(n11,10)

##         n  num
## the   the 9620
## and   and 4346
## for   for 1844
## that that 1649
## with with 1271
## said said 1241
## was   was 1093
## but   but  778
## from from  772
## his   his  758

n12 <- nlist(vn1, nGram2); head(n12,10)

##                 n num
## in the     in the 884
## of the     of the 877
## to the     to the 383
## on the     on the 364
## for the   for the 353
## at the     at the 269
## and the   and the 259
## in a         in a 255
## to be       to be 244
## with the with the 192

n13 <- nlist(vn1, nGram3); head(n13,10)

##                                 n num
## one of the             one of the  67
## a lot of                 a lot of  52
## going to be           going to be  31
## according to the according to the  30
## as well as             as well as  30
## be able to             be able to  28
## part of the           part of the  27
## the end of             the end of  26
## to be a                   to be a  25
## it was a                 it was a  23

b11 <- nlist(vb1, nGram1); head(b11,10)

##         n   num
## the   the 10207
## and   and  5997
## that that  2568
## for   for  1941
## you   you  1623
## with with  1620
## was   was  1510
## this this  1414
## have have  1195
## but   but  1147

b12 <- nlist(vb1, nGram2); head(b12,10)

##                 n  num
## of the     of the 1000
## in the     in the  846
## to the     to the  501
## to be       to be  385
## on the     on the  383
## for the   for the  332
## and the   and the  325
## and i       and i  286
## it was     it was  265
## with the with the  265

b13 <- nlist(vb1, nGram3); head(b13,10)

##                       n num
## a lot of       a lot of  82
## one of the   one of the  76
## some of the some of the  39
## to be a         to be a  39
## i want to     i want to  36
## it was a       it was a  36
## i have to     i have to  33
## the end of   the end of  32
## a couple of a couple of  31
## out of the   out of the  31

2.6. Corpora Filtering - Version 2

Now we produce new filtered versions of the subsetted corpora. In this case we remove stopwords alongside the other filters that were used for the first filtering exercise:

## Filter sample corpora by removing whitespace, numbers, punctuation, stopwords and profanity
## Convert sample corpora to lowercase
vt2 <- VCorpus(VectorSource(t1))
vt2 <- tm_map(vt2, stripWhitespace)
vt2 <- tm_map(vt2, content_transformer(tolower))
vt2 <- tm_map(vt2, removeWords, bad)
vt2 <- tm_map(vt2, removeWords, stopwords("english"))
vt2 <- tm_map(vt2, removePunctuation)
vt2 <- tm_map(vt2, removeNumbers)

vn2 <- VCorpus(VectorSource(n1))
vn2 <- tm_map(vn2, stripWhitespace)
vn2 <- tm_map(vn2, content_transformer(tolower))
vn2 <- tm_map(vn2, removeWords, bad)
vn2 <- tm_map(vn2, removeWords, stopwords("english"))
vn2 <- tm_map(vn2, removePunctuation)
vn2 <- tm_map(vn2, removeNumbers)

vb2 <- VCorpus(VectorSource(b1))
vb2 <- tm_map(vb2, stripWhitespace)
vb2 <- tm_map(vb2, content_transformer(tolower))
vb2 <- tm_map(vb2, removeWords, bad)
vb2 <- tm_map(vb2, removeWords, stopwords("english"))
vb2 <- tm_map(vb2, removePunctuation)
vb2 <- tm_map(vb2, removeNumbers)

2.7. N-Grams Produced (Version 1 - Without Stopwords)

We apply our functions to the second batch of filtered corpora. we also subset the resultant term-document matrices to obtain the ten most frequent n-grams in each, to facilitate plotting:

## Produce term-document matrices of the second batch of filtered corpora and 
## subset the matrices to the top ten most frequent n-grams in each one
t21 <- nlist(vt2, nGram1)
t22 <- nlist(vt2, nGram2)
t23 <- nlist(vt2, nGram3)
n21 <- nlist(vn2, nGram1)
n22 <- nlist(vn2, nGram2)
n23 <- nlist(vn2, nGram3)
b21 <- nlist(vb2, nGram1)
b22 <- nlist(vb2, nGram2)
b23 <- nlist(vb2, nGram3)
t21a <- t21[1:10,]; t22a <- t22[1:10,]; t23a <- t23[1:10,]
n21a <- n21[1:10,]; n22a <- n22[1:10,]; n23a <- n23[1:10,]
b21a <- b21[1:10,]; b22a <- b22[1:10,]; b23a <- b23[1:10,]

2.8. N-Gram Plotting - Twitter

Now we begin plotting the ten most frequent n-grams in the filtered corpora, using the ggplot2 and gridExtra packages.

First we show the 1- 2- and 3-grams for the twitter corpus:

suppressMessages(library(ggplot2)); suppressMessages(library(gridExtra))
pt1 <- ggplot(t21a, aes(num,n)) + geom_col()
pt2 <- ggplot(t22a, aes(num,n)) + geom_col()
pt3 <- ggplot(t23a, aes(num,n)) + geom_col()
grid.arrange(pt1,pt2,pt3,nrow=1)

2.9. N-Gram Plotting: News

Next we show the 1- 2- and 3-grams for the news corpus:

pn1 <- ggplot(n21a, aes(num,n)) + geom_col()
pn2 <- ggplot(n22a, aes(num,n)) + geom_col()
pn3 <- ggplot(n23a, aes(num,n)) + geom_col()
grid.arrange(pn1,pn2,pn3,nrow=1)

2.10. N-Gram Plotting: Blogs

Finally we show the 1- 2- and 3-grams for the blogs corpus:

pb1 <- ggplot(b21a, aes(num,n)) + geom_col()
pb2 <- ggplot(b22a, aes(num,n)) + geom_col()
pb3 <- ggplot(b23a, aes(num,n)) + geom_col()
grid.arrange(pb1,pb2,pb3,nrow=1)

3. Analysis of Findings

There are clear distinctions in the n-grams obtained between the twitter, news and blogs corpora.

The twitter corpus is typically less formal and more conversational, with shorter words and contracted words being used more often than in the other corpora. More emotive n-grams such as love are more prominent than in other corpora.

The news corpus shows a reasonably strong emphasis on places, prominent people and organisations, and what people have been recorded as saying.

The blogs corpus appears to take on a role of personal narrative; the n-grams are more formal and slightly more sophisticated than the twitter corpus, but still more personal than the news corpus.

These distinctions become clearer by removing stopwords, although a careful enough study of the n-grams including stopwords can still give similar impressions.

Some oddities show up in the n-grams (eg repetition of the single character ‘u’ in the news n-grams) indicating that further work on tokenisation, filtering and preparing the corpora for n-gram construction is warranted.

4. Plans for Prediction Algorithm

I am considering development of a prediction algorithm using the markovchain package in R, as there are indications that this is relatively simple and light on computational power compared to some other methods.

Stupid back-off seems to be a realistic method to apply to the problem of predicting when a phrase has not been encountered in the corpus utilised for prediction. If this proves too difficult or not particularly effective, some form of probabilistic substitution could be attempted instead.

There are indications that the quanteda package may work more efficiently than the tm package for NLP tasks, so I may have a look into this when developing the prediction model.