Create data folder and unzip the dataset, then load it in using a SimpleCorpus from the tm library.
if (!dir.exists("data")) {
dir.create("data", showWarnings = FALSE)
}
if (!file.exists("data/swiftkey.zip")) {
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", "data/swiftkey.zip")
}
if (!dir.exists("data/final")) {
unzip("data/swiftkey.zip", exdir="data")
}
Use bash commands to generate the word/line/char counts because it is much faster.
mkdir -p tmp
rm -f tmp/word_count.csv
rm -f tmp/char_count.csv
rm -f tmp/line_count.csv
wc -w data/final/en_US/en_US.blogs.txt | awk '{print $1}' > tmp/word_count.csv
wc -c data/final/en_US/en_US.blogs.txt | awk '{print $1}' > tmp/char_count.csv
wc -l data/final/en_US/en_US.blogs.txt | awk '{print $1}' > tmp/line_count.csv
wc -w data/final/en_US/en_US.news.txt | awk '{print $1}' >> tmp/word_count.csv
wc -c data/final/en_US/en_US.news.txt | awk '{print $1}' >> tmp/char_count.csv
wc -l data/final/en_US/en_US.news.txt | awk '{print $1}' >> tmp/line_count.csv
wc -w data/final/en_US/en_US.twitter.txt | awk '{print $1}' >> tmp/word_count.csv
wc -c data/final/en_US/en_US.twitter.txt | awk '{print $1}' >> tmp/char_count.csv
wc -l data/final/en_US/en_US.twitter.txt | awk '{print $1}' >> tmp/line_count.csv
# Read the CSV files into R data frames
word_count_df <- read.csv("tmp/word_count.csv", header = FALSE, col.names = "WordCount")
char_count_df <- read.csv("tmp/char_count.csv", header = FALSE, col.names = "CharCount")
line_count_df <- read.csv("tmp/line_count.csv", header = FALSE, col.names = "LineCount")
counts <- cbind(word_count_df, char_count_df)
counts <- cbind(counts, line_count_df)
counts$DocumentType <- c("blogs", "news", "twitter")
long_counts <- gather(counts, key="Variable", value="Count", -DocumentType)
ggplot(long_counts,
aes(x=DocumentType, y=Count, fill=Variable)) +
geom_bar(stat="identity", position="dodge") +
facet_wrap(~ Variable, scales="free_y") +
labs(title = "Counts for Document Type", y="Count") +
theme(legend.position="none")
corpus <- VCorpus(DirSource("data/final/en_US/",
encoding="UTF-8"),
readerControl = list(reader = readPlain))
summary(corpus)
## Length Class Mode
## en_US.blogs.txt 2 PlainTextDocument list
## en_US.news.txt 2 PlainTextDocument list
## en_US.twitter.txt 2 PlainTextDocument list
This loads all the text into memory, which is 872 MB. This is a manageable amount for now but should make sure not to needlessly duplicate or copy the data to much.
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
I am choosing not removing stop words, because the end goal is to predict the next word a user wants to type, often the next word a user wants to type will be a stop word.
We can see that the twitter document has by far the most lines, but it has less text/words than the other dataset, intuitively this is because of a harsh word limit for twitter messages.
Create a term document matrix, showing the frequency of words in each document, not including stop words. We also reduce words to their stems such that “cats”, “cat” are treated the same for frequency analysis for example.
tdm <- TermDocumentMatrix(corpus)
word_freq <- rowSums(as.matrix(tdm))
word_freq <- data.frame(word = names(word_freq), freq=word_freq)
word_freq <- word_freq[order(word_freq$freq, decreasing=TRUE), ]
ggplot(head(word_freq, 25), aes(x=reorder(word, -freq), y=freq)) +
geom_bar(stat="identity")
I will first tokenize into sentences, I will then randomly choose from these sentences to form train/test/validation sets.
corpus_sentences <- tm_map(corpus, content_transformer(tokenize_sentences))