Data Science Capstone

Week2: Exploration of text data

Ronan Mulligan

2020-08-20

Introduction

The goal of the capstone project is to develop a text prediction algorithm. The algorithm needs to have some basic properties: - it must be fast - it must be efficient (RStudios host servers are limited to 1Gb on the free-to-play model) - it must be accurate (-ish, speed is to be prioritised over accuracy according to the Coursera forums!)

The overall strategy for my app will follow the advice provided by Len Greski in his Simplify, Simplify, Simplify post on the discussion forum i.e.

"A simple solution to the Capstone can be accomplished with three key tools:

data.table – due to its high performance, low memory usage, and ability to do an indexed search like a database table, this package is extremely useful not only to create the data needed for the prediction algorithm, but it is also very valuable in the shiny app.

quanteda::tokens_ngrams() – the workhorse that will generate the data needed for the easiest possible algorithm, a simple back off model based on last word frequencies / probabilities given a set of first words

SQL with the sqldf package – given a set of n-grams that are aggregated into three columns, a base consisting of n-1 words in the n-gram, and a prediction that is the last word, and a count variable for the frequency of occurrence of this n-gram, it’s easy to write an SQL statement to extract the most frequently occurring prediction and save these into an output data.table for your shiny app"

The quanteda package

The quanteda package will be key to the analysis and I’ve referred to the quick start guide and the cheat sheet here quite a bit in the solution.

High Level approach: creating n-grams and Katz back-off algorithm plus a look-up table

The proposed approach will be to:

  1. convert the text files into corpora
  2. tokenise the corpora and produce n-grams (1, 2 and 3 grams should be sufficient)
  3. save the n-grams as data.table(s)
  4. use Katz back-off algorithm with Good-Turing discounting to estimate the probabilities of the next word given some input words (whether one, two or three words)
  5. Look up the most probable words and output

Week2 Exploratory Data Analysis

The data files

We will work on the English language files as that language is more familiar to me. There are three English language files taken from web scraping a news website, Twitter and weblogs website. The files are quite large. Sentences are incomplete and non-sequential to preserve anonymity.

Loading and Exploring the data

First we load the data and then we gather some basic information on the data. Here the readLines function from readtext package and the stri_stats_general function from the stringi package are useful.

## Warning in readLines(path_news, encoding = "UTF-8", skipNul = TRUE): incomplete
## final line found on 'C:/Users/rmulligan001/Documents/My Training/R/Data Science
## Specialization/Capstone/NLP/data/en_US.news.txt'

Summary table of information about the text files.

FileName FileSizeinMB Lines LinesNEmpty Chars CharsNWhite WordCount
en_US.blogs 200.4 899,288 899,288 206,824,382 170,389,539 37,570,839
en_US.news 196.3 77,259 77,259 15,639,408 13,072,698 2,651,432
en_US.twitter 159.4 2,360,148 2,360,148 162,096,241 134,082,806 30,451,170

The table shows that the file sizes are large and this may be a factor in later analysis as tokenisation is a memory intensive process.

Convert to a Corpus and Further exploratory analysis

We convert the raw text files to a corpus so that we can more easily analyse the data using quanteda.

We begin with the tokenisation of the corpus. Tokenisation 1 Wikipedia provides a simple overview of lexical analysis here: https://en.wikipedia.org/wiki/Lexical_analysis. coverts the text in the corpus to useful units (in this case words) and allows for easier future analysis including statistical analysis. The quanteda default tokeniser is used. There are a number of data cleaning steps performed as part of tokenisation including:

  1. removing numbers
  2. removing punctuation
  3. removing separators
  4. removing URLs
  5. converting all text to lower case (makes later analysis easier if we don’t have to deal with capitalisation)
  6. removing profanity - A google search for List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words will also find it this was achieved with the helpful (and inventive!) repository of “profane” words found on Github Profanity

The quanteda package contains the so-called “Swiss Army Knife” function dfm or document feature matrix. This is used to identify the key features of the data.

The most frequently occurring words are those we might expect. Profanity has been removed. We did not remove stopwords as these are frequently occurring and we are trying to build a prediction algorithm so we should expect frequently occurring words to be important!

Similarly the bigrams are as we might expect. “The” will feature prominently in our algorithm!

Ditto for the trigrams