In this report, I will be detailing the process of downloading and loading in the data from the provided text files. Next, a report of summary statistics will be detailed to gain an overall understanding of the corpus. Lastly, a very basic outline of my plan for the prediction model will be laid out. I hope to gain plenty of feedback on where my plan for the prediction model needs improvement, especially if there exists implementation issues.
The following code downloads the three text files in the form of a zip, and unzips them into the working directory.
fileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileURL, "./capstone.zip")
unzip("./capstone.zip")
Next, each text file is read and the number of lines is output.
setwd("~/R Data Science/Coursera-Notes/10 Capstone/final/en_US")
con <- file("./en_US.blogs.txt", "r")
blogs <- readLines(con)
length(blogs)
## [1] 899288
close(con)
con <- file("./en_US.news.txt", "r")
news <- readLines(con)
length(news)
## [1] 77259
close(con)
con <- file("./en_US.twitter.txt", "r")
twitter <- readLines(con)
length(twitter)
## [1] 2360148
close(con)
As we can see, the text files contain hundreds of thousands, if not millions of lines. Unfortunately, my computer is running on only 4GB of RAM and cannot handle model creation with such large inputs. To resolve this issue, I subsampled the text files, keeping only 12,500 lines per file.
set.seed(8675309)
blogsfull <- sample(blogs, 12500)
newsfull <- sample(news, 12500)
twitterfull <- sample(twitter, 12500)
The next step is to divide the subsampled text files into testing and training sets. I decided to make p = .8, which means the training sets would each contain 10,000 lines.
set.seed(8675309)
inTrain <- sort(sample(1:12500, 10000))
blogs <- blogsfull[inTrain]
blogstest <- blogsfull[-inTrain]
inTrain <- sort(sample(1:12500, 10000))
news <- newsfull[inTrain]
newstest <- newsfull[-inTrain]
inTrain <- sort(sample(1:12500, 10000))
twitter <- twitterfull[inTrain]
twittertest <- twitterfull[-inTrain]
To complete the rest of the explanatory analysis, I used the following libraries.
library(tokenizers)
library(ngram)
library(ggplot2)
The following code tokenized the text files and seperated them into individual words. The word counts for each subsample are provided as output.
blogwords <- tokenize_words(blogs)
newswords <- tokenize_words(news)
twitterwords <- tokenize_tweets(twitter)
length(unlist(blogwords))
## [1] 421314
length(unlist(newswords))
## [1] 349941
length(unlist(twitterwords))
## [1] 125307
all <- c(blogwords, newswords, twitterwords)
all <- unlist(all)
length(all)
## [1] 896562
Throughout the text, there exists random characters interspersed that do not make sense in context. Here are some examples.
head(all[grep("â|œ", all)])
## [1] "vcâ" "donâ" "itâ" "blumaâ" "â" "carlosâ"
These characters might have made sense if the text file was not in English, but since we are working with English language processing, the extraneous characters can be removed. The following code replaces the extra character with a blank space.
for(i in 1:length(all)){
all[i] <- gsub("â|œ", "", all[i])
}
First, a summary of the frequency of unigrams in the training set is found below.
table <- table(all)
table <- table[order(table, decreasing = TRUE)]
table[1:50]
## all
## the to and a of in i that for is it on you
## 43910 24080 22795 21500 18861 14803 14008 9633 9291 9013 8339 7012 6722
## with was at this my be as have but he are we
## 6503 5788 4853 4752 4700 4663 4611 4586 4293 4210 4176 4062 3848
## not from so they his by said all or an will me one
## 3611 3480 3137 2998 2980 2953 2889 2863 2827 2759 2753 2647 2593
## s up about out has when what who if had just
## 2543 2530 2523 2438 2314 2299 2294 2261 2247 2236 2235
barplot(table[1:30], main = "Word Frequency", xlab = "Word", las = 2)
Next, I created a function to determine how many unique words would be needed to cover a certain amount of the total words in the corpus. The first value output is the amount of unique words needed to cover 50 percent of all words in the corpus. The second value is the same concept, but with a 90 percent coverage.
total <- sum(table)
prop <- table/total
instances <- function(percent){
sum = 0
i = 1
while(sum < percent){
sum = sum + prop[i]
i = i + 1
}
i-1
}
instances(.5)
## [1] 151
instances(.9)
## [1] 7711
The next step in the process is to determine the distribution of n-grams. The frequency of 2-grams and 3-grams are summarized in the following table and bar plot.
x <- concatenate(blogs, news, twitter)
invisible(preprocess(x))
two <- ngram(x, n = 2)
twofreq <- head(get.phrasetable(two), 20)
twofreq
## ngrams freq prop
## 1 of the 4029 0.0045620014
## 2 in the 3452 0.0039086694
## 3 to the 1858 0.0021037971
## 4 on the 1640 0.0018569576
## 5 for the 1544 0.0017482577
## 6 to be 1352 0.0015308578
## 7 at the 1165 0.0013191193
## 8 and the 1142 0.0012930766
## 9 in a 996 0.0011277621
## 10 with the 957 0.0010836027
## 11 is a 867 0.0009816965
## 12 for a 763 0.0008639382
## 13 with a 761 0.0008616736
## 14 of a 754 0.0008537476
## 15 from the 749 0.0008480861
## 16 I was 733 0.0008299695
## 17 will be 675 0.0007642966
## 18 I have 661 0.0007484445
## 19 to get 609 0.0006895654
## 20 is the 600 0.0006793747
three <- ngram(x, n = 3)
threefreq <- head(get.phrasetable(three), 20)
threefreq
## ngrams freq prop
## 1 one of the 268 3.034544e-04
## 2 a lot of 261 2.955284e-04
## 3 to be a 161 1.822991e-04
## 4 out of the 154 1.743730e-04
## 5 be able to 129 1.460657e-04
## 6 the end of 124 1.404043e-04
## 7 as well as 124 1.404043e-04
## 8 going to be 123 1.392720e-04
## 9 some of the 110 1.245522e-04
## 10 the rest of 101 1.143615e-04
## 11 part of the 99 1.120970e-04
## 12 I want to 94 1.064355e-04
## 13 a couple of 94 1.064355e-04
## 14 I have to 88 9.964174e-05
## 15 end of the 81 9.171569e-05
## 16 you want to 75 8.492194e-05
## 17 I have a 74 8.378965e-05
## 18 is going to 73 8.265735e-05
## 19 it was a 73 8.265735e-05
## 20 is one of 73 8.265735e-05
par(mar = c(8, 4.1, 4.1, 2.1))
barplot(twofreq$freq, names.arg = twofreq$ngrams,
main = "2-gram Frequency", las = 2)
barplot(threefreq$freq, names.arg = threefreq$ngrams,
main = "3-gram Frequency", las = 2)
I am still uncertain on how exactly I will implement the following model, so any feedback would be greatly appreciated.
In order to predict the next word in a sequence of 1-gram, 2-gram, and 3-gram inputs, I plan to use Katz’s back off model. Additionally, I will use Good-Turing estimation as the method of smoothing.