Milestone Report - Natural Language Processing

Summary - Predicting the next word

This is a Milestone report for the Coursera Data Science Capstone Project “Natural Language Processing”. The goal of this project is to build an app that is able to predict the next word, similar to Swiftkey word prediction. This intermediate report provides the results of the exploratory data analysis and my goals for the app and algoritm.

The app wil use an n-gram model to predict the next word based on the previous 1, 2 or 3 words. A probability is assigned to each word and the next word with the highest probalility is selected.

In the first stage of the project the data is loaded and an Explaratory Data Analyisis is performed. The provided data is from 3 sources: blogs, twitter and news.

Libraries

There are various packages for natural language processing, I choose Quanteda based on internet research. It is a comphrehensive, fast and customizable R package for text analysis and management.

library(tidyverse)
library(quanteda) # r package for managing and analizing text
library(tokenizers) # used to tokenize words
library(quanteda.textplots) # word cloud plot
source("importDatafiles.R") # function that loads data

Exploratory Data Analysis

The first step is to load the data:

# 1. first download files from internet
fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
filename <- "./data/Coursera-SwiftKey.zip"
datadir <- "./data"
if(!file.exists(datadir)){dir.create(datadir)}
download.file(fileUrl, destfile = filename, method="curl") # curl: for https
unzip(filename, exdir = datadir, overwrite=TRUE)
# remove the zip file
file.remove(filename)

## [1] TRUE

# EN read data from blogs, news and twitter
ENdatadir <- "./data/en_US/"
EnBlogFile <-  "en_US.blogs.txt"
EnNewsFile <-  "en_US.news.txt"
EnTwitterFile <-  "en_US.twitter.txt"
blogdata <- read_lines(paste(ENdatadir,EnBlogFile, sep="")) # txt file with no header
newsdata <- read_lines(paste(ENdatadir,EnNewsFile, sep="")) # txt file with no header
twitdata <- read_lines(paste(ENdatadir,EnTwitterFile, sep="")) # txt file with no header

Summary of the data files statistics

Below a basic summary of the three data files including word counts and line counts:

##          File FileSize    Rows Sentences    Words
## 1    BlogData  200  Mb  899288   2375718 37546250
## 2    NewsData  196  Mb 1010242   2024588 34762395
## 3 TwitterData  159  Mb 2360148   3770155 30093372

Understand frequencies of words and word pairs

A function importDatafiles() is created that downloads and returns a datasubset of the blog, twitter and news data. By default only 2% of the content of the datafiles are loaded using random sampling.

set.seed(123)
datasubset <- importDatafiles(0.02)

Using this datasubset and the functions of the quanteda package, I have cleaned the data and build a corpus, tokens and a document feature Matrix (dfm). The most frequent used words and wordpairs are determined and plotted in a nice Word Cloud:

# build a corpus and clean the corpus
my_corpus <- corpus(datasubset)
my_tokens <- tokens(my_corpus, remove_punct = TRUE, remove_numbers = TRUE, 
                    remove_symbols = TRUE) %>%
            tokens_select(stopwords("en", source="stopwords-iso"), 
                          selection="remove") %>%
            tokens_wordstem() %>% 
            tokens_tolower() 

# create N-gram tokens for word pairs
my_tokens_2grams <- tokens_ngrams(my_tokens, n= 2)

# Now create dfm: document feature matrix for words and wordpairs
my_dfm <- dfm(my_tokens)
my_dfm2 <- dfm(my_tokens_2grams)

# topfeatures words
wfreq <- data.frame(topfeatures(my_dfm, 15))
colnames(wfreq)[1] <- "count"
wfreq$word <- row.names(wfreq)
# topfeatures wordpars
twofreq <- data.frame(topfeatures(my_dfm2, 15))
colnames(twofreq)[1] <- "count"
twofreq$word <- row.names(twofreq)

# Make some plots on the most used words
ggplot(wfreq, aes(x=reorder(word, -count), y = count)) +
    geom_bar(stat="identity", fill="lightblue") + ggtitle("Most popular words") +
    xlab("") + coord_flip()

ggplot(twofreq, aes(x=reorder(word, -count), y = count)) + ggtitle("Most popular wordpairs") +
    geom_bar(stat="identity", fill="lightblue") + xlab("") + coord_flip()

# use quanteda.textplots to create a word cloud
textplot_wordcloud(my_dfm, max_words = 50)

textplot_wordcloud(my_dfm2, max_words = 50)

Plan for Prediction Algorithm and App

The prediction algorithm should be an n-gram model (Markov Chain Model). To predict the next word, based on the previous 1, 2, or 3 words. A maximum of 4-gram model (unigrams, bigrams, 3-grams and 4-grams) is required. Also, some kind of smoothing is required (giving all n-grams a non-zero probability even if they aren’t in the observed data). A backoff model is suggested to estimate the probability of unobserved n-grams. As the model is developed, the size and runtime have to be minimized in order to provide a reasonable user experience.

The plan is:

Corpus creation
Cleaning the corpus and split into train (80%) and test (20%) corpus
N-gram generation
Generate prediction tables for each n-gram on the training corpus (containing probability of words). The prediction tables will be based on the Maximum Likelyhood Estimation (MLE). Also:
- Include smoothing algorithm
- Include a backoff model
Develop the model:
- For 3 words, use trigram prediction table to predict most likely word
- For 2 words, use bigram prediction table to predict most likely word
- For 1 word, use unigram prediction table to predict most likely word
Model Evaluation
- Use cross-validation techniques to validate the model using the testset as teached in the course.
Create shiny app where the user can input 1, 2 or 3 words and the next predicted word is displayed.