eda

Data preprocessing

Create data folder and unzip the dataset, then load it in using a SimpleCorpus from the tm library.

if (!dir.exists("data")) {
    dir.create("data", showWarnings = FALSE)
}
if (!file.exists("data/swiftkey.zip")) {
    download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", "data/swiftkey.zip")
}
if (!dir.exists("data/final")) {
    unzip("data/swiftkey.zip", exdir="data")
}

Basic exploration

Use bash commands to generate the word/line/char counts because it is much faster.

mkdir -p tmp
rm -f tmp/word_count.csv
rm -f tmp/char_count.csv
rm -f tmp/line_count.csv

wc -w data/final/en_US/en_US.blogs.txt | awk '{print $1}' > tmp/word_count.csv
wc -c data/final/en_US/en_US.blogs.txt | awk '{print $1}' > tmp/char_count.csv
wc -l data/final/en_US/en_US.blogs.txt | awk '{print $1}' > tmp/line_count.csv

wc -w data/final/en_US/en_US.news.txt | awk '{print $1}' >> tmp/word_count.csv
wc -c data/final/en_US/en_US.news.txt | awk '{print $1}' >> tmp/char_count.csv
wc -l data/final/en_US/en_US.news.txt | awk '{print $1}' >> tmp/line_count.csv

wc -w data/final/en_US/en_US.twitter.txt | awk '{print $1}' >> tmp/word_count.csv
wc -c data/final/en_US/en_US.twitter.txt | awk '{print $1}' >> tmp/char_count.csv
wc -l data/final/en_US/en_US.twitter.txt | awk '{print $1}' >> tmp/line_count.csv

# Read the CSV files into R data frames
word_count_df <- read.csv("tmp/word_count.csv", header = FALSE, col.names = "WordCount")
char_count_df <- read.csv("tmp/char_count.csv", header = FALSE, col.names = "CharCount")
line_count_df <- read.csv("tmp/line_count.csv", header = FALSE, col.names = "LineCount")

counts <- cbind(word_count_df, char_count_df)
counts <- cbind(counts, line_count_df)
counts$DocumentType <- c("blogs", "news", "twitter")

long_counts <- gather(counts, key="Variable", value="Count", -DocumentType)

ggplot(long_counts,
       aes(x=DocumentType, y=Count, fill=Variable)) + 
    geom_bar(stat="identity", position="dodge") +
    facet_wrap(~ Variable, scales="free_y") +
    labs(title = "Counts for Document Type", y="Count") +
    theme(legend.position="none")

corpus <- VCorpus(DirSource("data/final/en_US/",
                            encoding="UTF-8"),
                  readerControl = list(reader = readPlain))

summary(corpus)

##                   Length Class             Mode
## en_US.blogs.txt   2      PlainTextDocument list
## en_US.news.txt    2      PlainTextDocument list
## en_US.twitter.txt 2      PlainTextDocument list

This loads all the text into memory, which is 872 MB. This is a manageable amount for now but should make sure not to needlessly duplicate or copy the data to much.

corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)

I am choosing not removing stop words, because the end goal is to predict the next word a user wants to type, often the next word a user wants to type will be a stop word.

Exploration

We can see that the twitter document has by far the most lines, but it has less text/words than the other dataset, intuitively this is because of a harsh word limit for twitter messages.

Word frequency

Create a term document matrix, showing the frequency of words in each document, not including stop words. We also reduce words to their stems such that “cats”, “cat” are treated the same for frequency analysis for example.

tdm <- TermDocumentMatrix(corpus)

word_freq <- rowSums(as.matrix(tdm))
word_freq <- data.frame(word = names(word_freq), freq=word_freq)
word_freq <- word_freq[order(word_freq$freq, decreasing=TRUE), ]
ggplot(head(word_freq, 25), aes(x=reorder(word, -freq), y=freq)) +
    geom_bar(stat="identity")

eda

Cameron Tait

2024-01-06

Data preprocessing

Basic exploration

Exploration

Word frequency

Tokenize and split into train/test/validation