Exploratory analysis of Corpora

Tom Withey, 14/12/19

Introduction

This document describes some exploratory data analysis undertaken on a set of corpora, in preparation for building a natural language prediction algorithm. This forms the milestone report for week 2 of the data capstone. In this report, I describe the downloading and processing of a large set of textual data, and report some of the trends in the data. I also summarise briefly my plans for building a prediction model, based on the findings of this analysis.

For the sake of reproducibility, I have included all of my R-code, but this can be ignored for the non-interested reader.

Download and prepare data.

I download and unzip the data from the link given in the course.

fileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip" # as provided by the course notes
destfile = "./data/dataset.zip"
if (!file.exists(destfile)) {
  if(!file.exists(dirname(destfile))){dir.create(dirname(destfile))}
  download.file(fileURL,destfile=destfile)
  unzip(destfile,exdir = dirname(destfile))
}

Now that the data is downloaded and unzipped, I read in the three english laguage data sets, which are each sourced from blogs, news, and twitter respectively.

if (!exists("dblogs")) {
  dblogs <- readLines("./data/final/en_US/en_US.blogs.txt")
}
if (!exists("dnews")) {
  dnews <- readLines("./data/final/en_US/en_US.news.txt")
}
if (!exists("dtwitter")) {
  dtwitter <- readLines("./data/final/en_US/en_US.twitter.txt")
}

Summary of data

Using the ‘ngrams’ package, I summarise the word and line counts for each data set.

library(ngram)
# get a word and line count for each data source
blogwords <- format(sum(unlist(lapply(dblogs,wordcount, sep = " ", count.function = sum))),big.mark=",")
bloglines <- format(length(dblogs),big.mark=",")
newswords <- format(sum(unlist(lapply(dnews,wordcount, sep = " ", count.function = sum))),big.mark=",")
newslines <- format(length(dnews),big.mark=",")
twitwords <- format(sum(unlist(lapply(dtwitter,wordcount, sep = " ", count.function = sum))),big.mark=",")
twitlines <- format(length(dtwitter),big.mark=",")
# summarise the data in a table and print
data.source <- c("Blogs","News","Twitter")
number.words <- c(blogwords,newswords,twitwords)
number.lines <- c(bloglines,newslines,twitlines)
dsum <- data.frame(data.source,number.words,number.lines,stringsAsFactors = FALSE)
print(dsum)

##   data.source number.words number.lines
## 1       Blogs   37,334,131      899,288
## 2        News    2,643,969       77,259
## 3     Twitter   30,373,543    2,360,148

Clearly, the datasets are very large, with the “Blogs” dataset containing the most amount of words: 37,334,131. The “Twitter” dataset contains the most amount of lines: 2,360,148.

Looking for n-grams

Key to building a prediction model will be the use of n-grams, that is, chains of words which appear in text. A 1-gram is simply a single word, a 2-gram is a chain of two words and a 3-gram a chain of three words. Understanding common patterns of n-grams will allow a predictive model (i.e predicting what the next word will be in a sequence following a given chain of words) to be built.

Here, I have combined the separate blogs, news and twitter texts into a single object, then created a ‘Corpus’ object, and extracted all the 1, 2, and 3- word chains. Note that I have excluded any punctuation from the analysis.

# convert data to a corpora
alldata <- c(dblogs,dnews,dtwitter)
library(quanteda)
corp <- corpus(alldata)

# extract all 1-grams and calculate the frequency of occurence
dfm_1gram <- dfm(corp, ngrams=1, remove_punct = TRUE)
freq_1gram <- sort(colSums(dfm_1gram), decreasing=TRUE)

# extract all 2-grams and calculate the frequency of occurence
dfm_2gram <- dfm(corp, ngrams=2, remove_punct = TRUE)
freq_2gram <- sort(colSums(dfm_2gram), decreasing=TRUE)

# extract all 3-grams and calculate the frequency of occurence
dfm_3gram <- dfm(corp, ngrams=3, remove_punct = TRUE)
freq_3gram <- sort(colSums(dfm_3gram), decreasing=TRUE)

With the ngrams created, I have produced below a graph of the ten most frequently occuring examples for each ngram.

barplot(freq_1gram[1:10],horiz=TRUE,main="Frequency of 1-grams in Corpora",xlab="Frequency of occurence",las=1)

barplot(freq_2gram[1:10],horiz=TRUE,main="Frequency of 2-grams in Corpora",xlab="Frequency of occurence",las=1)

par(mar=c(5,8,4,1)+.1)
barplot(freq_3gram[1:10],horiz=TRUE,main="Frequency of 3-grams in Corpora",xlab="Frequency of occurence",las=1)

It’s clear that several chains of words are more common than others. The most frequently occuring word (or 1-gram) in the dataset is “the”. The most frequently occuring chain of two words (or 2-gram) is “of_the”. The most frequently occuring chain of three words (or 3-gram) is “thanks_for_the”.

Creating a prediction algorithm and shiny app

From the analysis undertaken so far, it’s clear that there are some very strong patterns and occurences of certain words and word chains. Given the frequency with which these occur, it should be possible to estimate the probable next word in a given chain. i.e. from a chain of five consecutive words, it should be possible to predict a sixth word in the chain through an examination of 6-grams. Similarly, for predicting the fifth word following a chain of four, an examination of 5-grams would be required, and so on.

So a prediction algorithm model, embedded within a shiny app could be built using a database of n-grams. The shiny app would allow the user to input text, and then using the predictive algorithm, suggest the next word (or words) to complete the text input.

One final point to note is that building n-grams from a Corpora takes a lot of computational time, As an example, building the 1-, 2-, and 3-ngrams described in this report took over an hour of run time. This is impractical for a model so we should consider building data up from a smaller size of text, most likely by sampling from the full database.