This is a Milestone report for the Coursera Data Science Capstone Project “Natural Language Processing”. The goal of this project is to build an app that is able to predict the next word, similar to Swiftkey word prediction. This intermediate report provides the results of the exploratory data analysis and my goals for the app and algoritm.
The app wil use an n-gram model to predict the next word based on the previous 1, 2 or 3 words. A probability is assigned to each word and the next word with the highest probalility is selected.
In the first stage of the project the data is loaded and an Explaratory Data Analyisis is performed. The provided data is from 3 sources: blogs, twitter and news.
There are various packages for natural language processing, I choose Quanteda based on internet research. It is a comphrehensive, fast and customizable R package for text analysis and management.
library(tidyverse)
library(quanteda) # r package for managing and analizing text
library(tokenizers) # used to tokenize words
library(quanteda.textplots) # word cloud plot
source("importDatafiles.R") # function that loads data
The first step is to load the data:
# 1. first download files from internet
fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
filename <- "./data/Coursera-SwiftKey.zip"
datadir <- "./data"
if(!file.exists(datadir)){dir.create(datadir)}
download.file(fileUrl, destfile = filename, method="curl") # curl: for https
unzip(filename, exdir = datadir, overwrite=TRUE)
# remove the zip file
file.remove(filename)
## [1] TRUE
# EN read data from blogs, news and twitter
ENdatadir <- "./data/en_US/"
EnBlogFile <- "en_US.blogs.txt"
EnNewsFile <- "en_US.news.txt"
EnTwitterFile <- "en_US.twitter.txt"
blogdata <- read_lines(paste(ENdatadir,EnBlogFile, sep="")) # txt file with no header
newsdata <- read_lines(paste(ENdatadir,EnNewsFile, sep="")) # txt file with no header
twitdata <- read_lines(paste(ENdatadir,EnTwitterFile, sep="")) # txt file with no header
Below a basic summary of the three data files including word counts and line counts:
## File FileSize Rows Sentences Words
## 1 BlogData 200 Mb 899288 2375718 37546250
## 2 NewsData 196 Mb 1010242 2024588 34762395
## 3 TwitterData 159 Mb 2360148 3770155 30093372
A function importDatafiles() is created that downloads
and returns a datasubset of the blog, twitter and news data. By default
only 2% of the content of the datafiles are loaded using random
sampling.
set.seed(123)
datasubset <- importDatafiles(0.02)
Using this datasubset and the functions of the quanteda package, I have cleaned the data and build a corpus, tokens and a document feature Matrix (dfm). The most frequent used words and wordpairs are determined and plotted in a nice Word Cloud:
# build a corpus and clean the corpus
my_corpus <- corpus(datasubset)
my_tokens <- tokens(my_corpus, remove_punct = TRUE, remove_numbers = TRUE,
remove_symbols = TRUE) %>%
tokens_select(stopwords("en", source="stopwords-iso"),
selection="remove") %>%
tokens_wordstem() %>%
tokens_tolower()
# create N-gram tokens for word pairs
my_tokens_2grams <- tokens_ngrams(my_tokens, n= 2)
# Now create dfm: document feature matrix for words and wordpairs
my_dfm <- dfm(my_tokens)
my_dfm2 <- dfm(my_tokens_2grams)
# topfeatures words
wfreq <- data.frame(topfeatures(my_dfm, 15))
colnames(wfreq)[1] <- "count"
wfreq$word <- row.names(wfreq)
# topfeatures wordpars
twofreq <- data.frame(topfeatures(my_dfm2, 15))
colnames(twofreq)[1] <- "count"
twofreq$word <- row.names(twofreq)
# Make some plots on the most used words
ggplot(wfreq, aes(x=reorder(word, -count), y = count)) +
geom_bar(stat="identity", fill="lightblue") + ggtitle("Most popular words") +
xlab("") + coord_flip()
ggplot(twofreq, aes(x=reorder(word, -count), y = count)) + ggtitle("Most popular wordpairs") +
geom_bar(stat="identity", fill="lightblue") + xlab("") + coord_flip()
# use quanteda.textplots to create a word cloud
textplot_wordcloud(my_dfm, max_words = 50)
textplot_wordcloud(my_dfm2, max_words = 50)
The prediction algorithm should be an n-gram model (Markov Chain Model). To predict the next word, based on the previous 1, 2, or 3 words. A maximum of 4-gram model (unigrams, bigrams, 3-grams and 4-grams) is required. Also, some kind of smoothing is required (giving all n-grams a non-zero probability even if they aren’t in the observed data). A backoff model is suggested to estimate the probability of unobserved n-grams. As the model is developed, the size and runtime have to be minimized in order to provide a reasonable user experience.
The plan is:
Corpus creation
Cleaning the corpus and split into train (80%) and test (20%) corpus
N-gram generation
Generate prediction tables for each n-gram on the training corpus (containing probability of words). The prediction tables will be based on the Maximum Likelyhood Estimation (MLE). Also:
Develop the model:
Model Evaluation
Create shiny app where the user can input 1, 2 or 3 words and the next predicted word is displayed.