Capstone - milestone report 1

Executive Summary

This report provides a brief summary of my work completed to date toward creating a natural language processing application that predicts the next word to be typed, given the last word or phrase typed. Four items are addressed in this report:

SwiftKey data is downloaded and loaded into R
Summaries of the data sets
Interesting findings thus far
Plans for creating prediction algorithm and Shiny app

Loading the data

I read in each of the english text files and sample the data in each to create a smaller set of data to speed processing time during initial exploration and algorithm construction. These files are loaded into a single corpus using the quanteda package.

library(quanteda)
library(readtext)
library(tidyverse)
library(ggplot2)
library(gridExtra)

setwd("C:/Users/207014104/Desktop/Box Sync/Personal/DataScience/capstone/Coursera-SwiftKey/final/en_US")

con.t <- file("en_US.twitter.txt", "r") 
con.n <- file("en_US.news.txt", "r")
con.b <- file("en_US.blogs.txt", "r")

#divisor for reduction of lines
redx <- 20

set.seed(17)

t <- readLines(con.t, skipNul = TRUE)
size.t <- length(t) / redx 
twitter <- sample(t, size.t)
writeLines(twitter, con = "twitter-sized.txt")
close(con.t)

n <- readLines(con.n, skipNul = TRUE)
size.n <- length(n) / redx * 4 # take more news since its smaller
news <- sample(n, size.n)
writeLines(news, con = "news-sized.txt")
close(con.n)

b <- readLines(con.b, skipNul = TRUE)
size.b <- length(b) / redx
blogs <- sample(b, size.b)
writeLines(blogs, con = "blogs-sized.txt")
close(con.b)

corp.all <- corpus(c(twitter, news, blogs))

rm(t, n, b, twitter, news, blogs)

# a smaller corpus combining the samples
corp <- corpus(readtext("*sized.txt"))

Summary of the data

Prior to cleaning data, here are the raw statistics of the sample:

summary(corp)

## Corpus consisting of 3 documents:
## 
##               Text Types  Tokens Sentences
##    blogs-sized.txt 95347 2205460    100502
##     news-sized.txt 49078  623832     28071
##  twitter-sized.txt 93437 1845722    128409
## 
## Source: C:/Users/207014104/Desktop/Box Sync/Personal/DataScience/capstone/Coursera-SwiftKey/final/en_US/* on x86-64 by 207014104
## Created: Fri Jul 05 21:59:02 2019
## Notes:

Next, the data is cleaned and word-level unigrams, bigrams and trigrams are created. Cleaning consists of removing numbers, symbols, punctuation, urls, twitter handles, hyphens and profanity. In order to keep this report concise, the cleaning and plotting code is hidden.

Here is a histogram of the number of appearances of individual words. It is apparent that most words appear less than 25 times, and the vast majority of unique words only appear 1 or 2 times.

However, there are a small subset of words the appear very often, and account for a very large portion of the overal corpus. Here are the top 10 for each of the sampled documents and their number of appearances:

The top 10 words in each document appear very frequently, and in similar order with a few small variations. The most frequent words are not suprising, and not especially interesting either.

Interesting findings

The top 10 bigram words also appear very frequently, although somewhat more distributed in terms of totals for each bigram than the top unigrams. However, it is notable that 7 of the top 10 second words are “the”.

Finally, I provide two plots that identify the number of unique words at which 50% and 90% of the total sample of unigrams and bigrams would be covered.

Plans for algorithm and app

Going forward, I plan to build an algorithm and create a shiny app that predicts the next word based on a user-provided word or phrase. Two key tradeoffs to consider are processing time, and total coverage.

Given the above analysis, it appears that the total dictionary from which unigrams and bigrams will be retrieved can be reduced and still maintain a high level of coverage, instead of including all words in the sample.

In order to achieve the highest level of coverage, my alogirthm will consider the previous 3 words and return the most frequent last word from a four-gram given the first three words. If there is not history for those three words, it will back off to 2 words and consider the most frequeny trigram starting with those 2. If no matches, it will consider the 2nd word of a bigram. I look forward to tuning this model for performance and accuracy and building the application.