Project Description
The Capstone Final Project is the final part of the Data Science Specialisation from Johns Hopkins University, and requires the implementation of a text prediction algorithm using a base corpus of news articles, blogs and tweets provided by SwiftKey.
In this process, we have attempted to strike a balance between accuracy of the text prediction model and the efficiency of loading, processing and analysing the raw data.
Data Processing
Introduction
The raw data for the project is provided by SwiftKey who have provided raw text data in English, German, Russian and Finnish. In this project we have only used the English language articles to create an English language text predictor.
Data Importing
Data has been read in using the readr package within R, which offers speed of processing over other traditional packages used in reading in text data. Read lines read in each line of text as a seperate element in a character vector.
library(readr)
twitter_data <- read_lines("en_US/en_US.twitter.txt")
news_data <- read_lines("en_US/en_US.news.txt")
blogs_data <- read_lines("en_US/en_US.blogs.txt")Next the data was processed into data.tables using the data.table package. This was prefered over using data frames given the size of data as it can be upto 20 times faster in the method in which data is modified within the table.
Data Summary
The initial task was to assess the size of the data in each of the three text files provided. A summary of the data is presented below:
| Source | Size..Mb. | Number_of_Lines |
|---|---|---|
| 318.9895 | 2360148 | |
| News | 257.3404 | 1010242 |
| Blogs | 255.3545 | 899288 |
As we can see the files are all large in terms of size, especially when we consider they are to be used as an input into a NLP routine, where data table sizes can increase exponentially when changed into document frequency matrices.
Below are examples of the longest lines of text from each of the three sources and their character lengths
| type | max_length | text |
|---|---|---|
| 140 | Don’t c… | |
| news | 11384 | (DEMOCRATS) PRESIDENT OF THE U.S. Barack Obama 46,938 UNITED STATES SENATOR Joe Donnelly 42,265 G… |
| blogs | 40833 | UPDATE AS OF 11:30 A.M. EDT, MONDAY, APRIL 11: No damage to Japan’s nuclear power plants was repo… |
Exploratory Data Analyis
An initial analysis of the data will use all the English language data, however, this will be reduced in size when considering the intial prediction model.
Natural Language Processing
Creating N-Grams
The processing of the text data has predominantly been completed using data.tables and the Quanteda package in order to improve speed of processing.
The next step is to process the raw data into N-Grams to see which are the most frequent. The process followed was to:
- Create a corpus for each of the three data sources (twitter, news and blogs)
- Tokenise the data which splits each line into individual words
- Remove numbers, punctuation, separators, split hyphens and make lowercase
- Remove a list of profanity words from the text
- Create N-Grams of the order between 1 to 3 and analyse the most frequent phrases.
Note that the text does not remove stopwords, or use lemmatization as the inputs will be used for text prediction on an input from a user which will include words such as “the, and, or”
Analysing N-Gram Frequencies
The output N-Gram top 50 words for each source were saved into an RData file after processing given the time taken to process each N-Gram. These were then charted using the GGPlot package.
load("top_50_grams.Rdata")
library(ggplot2)
ggplot( onegram_top_50_all_types, aes( fill = type, x = reorder(feature, desc(total_sum)), y = frequency)) +
geom_bar( position = "stack", stat = "identity") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs( x = "Most Frequent Onegrams", y = "Frequency", title = "Stacked Bar Chart of Most Frequent Onegrams")ggplot( bigram_top_50_all_types, aes( fill = type, x = reorder(feature, desc(total_sum)), y = frequency)) +
geom_bar( position = "stack", stat = "identity") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs( x = "Most Frequent Bigrams", y = "Frequency", title = "Stacked Bar Chart of Most Frequent Bigrams")ggplot( trigram_top_50_all_types, aes( fill = type, x = reorder(feature, desc(total_sum)), y = frequency)) +
geom_bar( position = "stack", stat = "identity") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs( x = "Most Frequent Trigrams", y = "Frequency", title = "Stacked Bar Chart of Most Frequent Trigrams")Prediction Algorithm
Approach
The N-Grams calculated above could then be used in a next word text prediction algorithm which can ultimately be hosted within a Shiny Webapp. With this in mind the speed of the algorithm and processing are paramount.
It is for this reason that the initial corpus used will need to be reduced before building the text prediction function. For the first algorithm a sample of 5000 of each of the three sources of information will be used to create a much slimmed down corpus.
The downside of this will be accuracy of the prediction, but the gain will be speed of processing and prediction.
Data Input
The slimmed down corpus is tranformed into unigrams through 5-Grams via the above method of tokenisation, calcuating a document frequency matrix and then N-Gram frequency tables for each of the five N-Gram Tables
Algorithm
Simple Frequency
There are many approaches to predicting the next word given an input string from a user. The simplest is to match the n-1 gram to the input string. e.g. if the user input “Once upon a time” we would search the last four words of our 5-Gram frequency table to find the most frequent 5-Gram which completes the sentance and suggest this as the next word.
This simple approach falls down when there are no five grams which match the search term. It is for this reason the initial approach implemented will be a Stupid Backoff Model.
Stupid Backoff
In this model if no 5-Grams are found, the algorithm will backoff to the 4-Gram frquency tables for the most popular match, then 3-Grams, 2-Grams and unigrams respectively.
The stupid backoff model also implements a lambda discounting constant to smooth the counts of the lower order N-Grams. In the case where no N-Gram matches are found, the most popular unigrams are suggested.
All the above approaches reduce computational complexity by using Markov Chains so that the probability computations are only calucated on the previous words, rather than computing the probability based on all previous words in the input string. This vastly reduces the complexity of the prediction algorithms and improves speed.
Other Approaches
Other options for text prediction algorithms are to use the Katz Backoff model, which implements a more accurate, but complex version of the lamda discounting in the Stupid Backoff model, or alternatively Kneser-Ney which again is a more complex but more accurate prediction model. Both these approaches use Good Turing to smooth the probability distributions.
However, given the large dataset and the use of counts rather than probabilities the Stupid Backoff Model has been implemented in the initial model.
Next Steps
The next steps after building the initial prediction model are to: 1. Improve speed of the algorithm: Improving the code efficiency 2. Improve accuracy: Consider using an initial corpus of most frequent N-Grams over a random sample 3. Build a Shinyapp user interface to interact with the prediction model 4. Test the accuracy of the final model