Milestone Report

Introduction
Downloading and Reading the Data
Tokenization
Summary Statistics
Prediction Blueprint

Introduction

In this report, I will be detailing the process of downloading and loading in the data from the provided text files. Next, a report of summary statistics will be detailed to gain an overall understanding of the corpus. Lastly, a very basic outline of my plan for the prediction model will be laid out. I hope to gain plenty of feedback on where my plan for the prediction model needs improvement, especially if there exists implementation issues.

Downloading and Reading the Data

The following code downloads the three text files in the form of a zip, and unzips them into the working directory.

fileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileURL, "./capstone.zip")
unzip("./capstone.zip")

Next, each text file is read and the number of lines is output.

setwd("~/R Data Science/Coursera-Notes/10 Capstone/final/en_US")
con <- file("./en_US.blogs.txt", "r")
blogs <- readLines(con)
length(blogs)

## [1] 899288

close(con)

con <- file("./en_US.news.txt", "r")
news <- readLines(con)
length(news)

## [1] 77259

close(con)

con <- file("./en_US.twitter.txt", "r")
twitter <- readLines(con)
length(twitter)

## [1] 2360148

close(con)

As we can see, the text files contain hundreds of thousands, if not millions of lines. Unfortunately, my computer is running on only 4GB of RAM and cannot handle model creation with such large inputs. To resolve this issue, I subsampled the text files, keeping only 12,500 lines per file.

set.seed(8675309)

blogsfull <- sample(blogs, 12500)
newsfull <- sample(news, 12500)
twitterfull <- sample(twitter, 12500)

The next step is to divide the subsampled text files into testing and training sets. I decided to make p = .8, which means the training sets would each contain 10,000 lines.

set.seed(8675309)

inTrain <- sort(sample(1:12500, 10000))
blogs <- blogsfull[inTrain]
blogstest <- blogsfull[-inTrain]

inTrain <- sort(sample(1:12500, 10000))
news <- newsfull[inTrain]
newstest <- newsfull[-inTrain]

inTrain <- sort(sample(1:12500, 10000))
twitter <- twitterfull[inTrain]
twittertest <- twitterfull[-inTrain]

Tokenization

To complete the rest of the explanatory analysis, I used the following libraries.

library(tokenizers)
library(ngram)
library(ggplot2)

The following code tokenized the text files and seperated them into individual words. The word counts for each subsample are provided as output.

blogwords <- tokenize_words(blogs)
newswords <- tokenize_words(news)
twitterwords <- tokenize_tweets(twitter)

length(unlist(blogwords))

## [1] 421314

length(unlist(newswords))

## [1] 349941

length(unlist(twitterwords))

## [1] 125307

all <- c(blogwords, newswords, twitterwords) 
all <- unlist(all)

length(all)

## [1] 896562

Throughout the text, there exists random characters interspersed that do not make sense in context. Here are some examples.

head(all[grep("â|œ", all)])

## [1] "vcâ"     "donâ"    "itâ"     "blumaâ"  "â"       "carlosâ"

These characters might have made sense if the text file was not in English, but since we are working with English language processing, the extraneous characters can be removed. The following code replaces the extra character with a blank space.

for(i in 1:length(all)){
        all[i] <- gsub("â|œ", "", all[i])
}

Summary Statistics

First, a summary of the frequency of unigrams in the training set is found below.

table <- table(all)
table <- table[order(table, decreasing = TRUE)]
table[1:50]

## all
##   the    to   and     a    of    in     i  that   for    is    it    on   you 
## 43910 24080 22795 21500 18861 14803 14008  9633  9291  9013  8339  7012  6722 
##  with   was    at  this    my    be    as  have   but    he   are          we 
##  6503  5788  4853  4752  4700  4663  4611  4586  4293  4210  4176  4062  3848 
##   not  from    so  they   his    by  said   all    or    an  will    me   one 
##  3611  3480  3137  2998  2980  2953  2889  2863  2827  2759  2753  2647  2593 
##     s    up about   out   has  when  what   who    if   had  just 
##  2543  2530  2523  2438  2314  2299  2294  2261  2247  2236  2235

barplot(table[1:30], main = "Word Frequency", xlab = "Word", las = 2)

Next, I created a function to determine how many unique words would be needed to cover a certain amount of the total words in the corpus. The first value output is the amount of unique words needed to cover 50 percent of all words in the corpus. The second value is the same concept, but with a 90 percent coverage.

total <- sum(table)
prop <- table/total

instances <- function(percent){        
        sum = 0 
        i = 1
        while(sum < percent){
                sum = sum + prop[i]
                i = i + 1
        }
        i-1
}

instances(.5)

## [1] 151

instances(.9)

## [1] 7711

The next step in the process is to determine the distribution of n-grams. The frequency of 2-grams and 3-grams are summarized in the following table and bar plot.

x <- concatenate(blogs, news, twitter)
invisible(preprocess(x))

two <- ngram(x, n = 2)
twofreq <- head(get.phrasetable(two), 20)
twofreq

##       ngrams freq         prop
## 1    of the  4029 0.0045620014
## 2    in the  3452 0.0039086694
## 3    to the  1858 0.0021037971
## 4    on the  1640 0.0018569576
## 5   for the  1544 0.0017482577
## 6     to be  1352 0.0015308578
## 7    at the  1165 0.0013191193
## 8   and the  1142 0.0012930766
## 9      in a   996 0.0011277621
## 10 with the   957 0.0010836027
## 11     is a   867 0.0009816965
## 12    for a   763 0.0008639382
## 13   with a   761 0.0008616736
## 14     of a   754 0.0008537476
## 15 from the   749 0.0008480861
## 16    I was   733 0.0008299695
## 17  will be   675 0.0007642966
## 18   I have   661 0.0007484445
## 19   to get   609 0.0006895654
## 20   is the   600 0.0006793747

three <- ngram(x, n = 3)
threefreq <- head(get.phrasetable(three), 20)
threefreq

##          ngrams freq         prop
## 1   one of the   268 3.034544e-04
## 2     a lot of   261 2.955284e-04
## 3      to be a   161 1.822991e-04
## 4   out of the   154 1.743730e-04
## 5   be able to   129 1.460657e-04
## 6   the end of   124 1.404043e-04
## 7   as well as   124 1.404043e-04
## 8  going to be   123 1.392720e-04
## 9  some of the   110 1.245522e-04
## 10 the rest of   101 1.143615e-04
## 11 part of the    99 1.120970e-04
## 12   I want to    94 1.064355e-04
## 13 a couple of    94 1.064355e-04
## 14   I have to    88 9.964174e-05
## 15  end of the    81 9.171569e-05
## 16 you want to    75 8.492194e-05
## 17    I have a    74 8.378965e-05
## 18 is going to    73 8.265735e-05
## 19    it was a    73 8.265735e-05
## 20   is one of    73 8.265735e-05

par(mar = c(8, 4.1, 4.1, 2.1))
barplot(twofreq$freq, names.arg = twofreq$ngrams, 
        main = "2-gram Frequency", las = 2)

barplot(threefreq$freq, names.arg = threefreq$ngrams,
        main = "3-gram Frequency", las = 2)

Prediction Blueprint

I am still uncertain on how exactly I will implement the following model, so any feedback would be greatly appreciated.

In order to predict the next word in a sequence of 1-gram, 2-gram, and 3-gram inputs, I plan to use Katz’s back off model. Additionally, I will use Good-Turing estimation as the method of smoothing.