This is a milestone report for the Coursera Data Science Capstone course. The goal of this report is just to display that I have gotten used to working with the data and that I am on track to create the prediction algorithm. I downloaded, loaded up, sampled, and tidied the data for analysis. I then ran some basic n-gram analysis.
Using the file connection and readLines functions, I loaded in all three US-English texts. The three objects correspond with the data sources: twit for Twitter, news for news sources, and blog for blog sources. The skipNul argument was important because there were a number of dysfunctional lines.
# Download and unzip dataset
if (!file.exists("Coursera-SwiftKey.zip")) {
fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileUrl, destfile = "./Coursera-SwiftKey.zip") }
if (!file.exists("final")) { unzip("Coursera-SwiftKey.zip") }
# Load data
con <- file("final/en_US/en_US.twitter.txt", "rb")
twit <- readLines(con, skipNul=TRUE, encoding="UTF-8")
close(con)
con <- file("final/en_US/en_US.news.txt", "rb")
news <- readLines(con, skipNul=TRUE, encoding="UTF-8")
close(con)
con <- file("final/en_US/en_US.blogs.txt", "rb")
blog <- readLines(con, skipNul=TRUE, encoding="UTF-8")
close(con)
I manually created a table summarizing the three text files. Two key functions used the tokenizers package: count_words and count_characters. I also use the tokenize_fastestword() function from the quanteda package. Across all three files, there are sum(dataSum$numLines) lines and sum(dataSum$numWords) words. Perhaps not surprisingly, the longest line for Twitter is 140 characters, and Twitter had by far the shortest average line length, at ~69 characters. This was a rather heavy calculation and took several minutes to run, which doesn’t bode well for more complicated functions later on.
library(tokenizers)
library(quanteda)
dataSum <- data.frame(files = c("Twitter", "News", "Blogs"),
numLines = c(length(twit),
length(news),
length(blog)),
numWords = c(sum(count_words(twit)),
sum(count_words(news)),
sum(count_words(blog))),
longestLine = c(max(count_characters(twit)),
max(count_characters(news)),
max(count_characters(blog))),
meanLineLength = c(mean(count_characters(twit)),
mean(count_characters(news)),
mean(count_characters(blog))),
uniqueWords = c(length(unique(unlist(tokenize_fastestword(twit)))),
length(unique(unlist(tokenize_fastestword(news)))),
length(unique(unlist(tokenize_fastestword(blog))))))
dataSum
## files numLines numWords longestLine meanLineLength uniqueWords
## 1 Twitter 2360148 30093413 140 68.68043 1290173
## 2 News 1010242 34762395 11384 201.16284 876770
## 3 Blogs 899288 37546239 40833 229.98668 1103503
At hundreds of megabytes, the full files would be too large for the rest of the exercise. I sampled from them and combined them into a single corpus to make analysis easier. There are still 60,000 total lines, which I hypothesize will be sufficient. This chunk introduces the quanteda package, which allows us to create a corpus object using the corpus() function.
library(quanteda)
# Sample data and recombine as one corpus
set.seed(1984)
twitSample <- sample(twit, size=20000)
newsSample <- sample(news, size=20000)
blogSample <- sample(blog, size=20000)
totalSample <- c(twitSample, newsSample, blogSample)
writeLines(totalSample, "./totalSample.txt")
textcon <- file("totalSample.txt")
corpus <- readLines(textcon)
close(textcon)
rm(textcon)
# Turn files into corpora
corpus <- corpus(corpus) # Read as a list for easier cleaning
Then, as requested by the assignment, I filtered out profanity. I manually input four words of profanity as regular expressions. I then used quanteda’s corpus_trim() function to exclude all lines that had those regex. I excluded entire lines instead of words, because otherwise it could produce odd ngrams.
I created a table to see the effects of removing lines with profanity. It only removed 912 sentences, equivalent to <1% of the total set.
fourWords <- "[Ss][Hh][Ii][Tt]|[Pp][Ii][Ss][Ss]|[Ff][Uu][Cc][Kk]|[Cc][Uu][Nn][Tt]"
corpusClean <- corpus_trim(corpus, exclude_pattern = fourWords, what = "sentences")
profanityTable <- data.frame(WithProfanity = c(sum(count_sentences(corpus))),
noProfanity = c(sum(count_sentences(corpusClean))),
sentencesRemoved = c(sum(count_sentences(corpus)) - sum(count_sentences(corpusClean))),
percReduction = c(1 - sum(count_sentences(corpusClean) / sum(count_sentences(corpus)))))
profanityTable
## WithProfanity noProfanity sentencesRemoved percReduction
## 1 124492 123580 912 0.007325772
Finally I used the tokens() function from the quanteda package to remove punctuation, numbers, symbols, and URLs. These characters would get in the way of analysis.
tokensClean <- tokens(corpusClean, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, remove_url = TRUE)
Comfortable with the dataset, I decided to do some more visual exploration. I plotted a histogram of word frequency. It excluded stopwords like “is” and “the,” which would dominate the frequency. The histogram demonstrates the extreme concentrations of word frequency. A very small number of words appear thousands of times, whereas tens of thousands of words appear very few times.
This analysis relies on more packages. I use quanteda functions for all of the NLP (tokens_remove(), dfm(), textstat_frequency(), etc.) I use ggplot2 (via Tidyverse) for plotting.
library(tidyverse)
tokensClean %>%
tokens_remove(pattern = stopwords('en')) %>%
dfm(tolower = TRUE) %>%
textstat_frequency() %>%
ggplot(aes(x = frequency)) +
geom_histogram(binwidth = 1, aes(fill = "red")) +
theme(legend.position = "none") +
labs(x = "Number of Appearances of Word", y = "Frequency") +
scale_y_log10()
I was curious to see what the most frequent words (excl. stopwords) were, so I plotted the frequency of the top 20. The word ‘said’ was the runaway winner, with nearly 6,000 appearances.
tokensClean %>%
tokens_remove(pattern = stopwords('en')) %>%
dfm(tolower = TRUE) %>%
textstat_frequency(n = 20) %>%
ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
geom_point() +
coord_flip() +
labs(x = "Number of word appearances", y = "Frequency")
Next I looked at the most frequent bigrams. I again excluded stopwords, though I plan to leave stopwords in the final model. Not surprisingly, the frequencies fell dramatically, with the top bigram (“last year”) appearing a little over 350 times. As the n in ngram increases, the frequencies fall.
tokensClean %>%
tokens_remove(pattern = stopwords('en')) %>%
tokens_ngrams(n = 2) %>%
dfm(tolower = TRUE) %>%
textstat_frequency(n = 20) %>%
ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
geom_point() +
coord_flip() +
labs(x = "Number of Appearances of N-gram", y = "Frequency")
Finally I checked to see how many unique words would be needed to cover 50%, 90% of all word instances. This time it does include stopwords. This would give me a sense of how small I could make my model without losing too much accuracy. What I can see from this plot and table is that a few words account for a large percentage of total words. The top 10 words account for over 20% of total words! A mere 150 words account for over 50% of total words. The marginal coverage diminishes rapidly, though. At 1000 words, we achieve 70% coverage – 6.6x as many words for only 20% more coverage. To surpass 80% coverage, we need 3000 words.
uniqueNeeds <- function(dictLength) {
tokensClean %>%
dfm(tolower = TRUE) %>%
textstat_frequency(n = dictLength) -> num
tokensClean %>%
dfm(tolower = TRUE) %>%
textstat_frequency() -> den
data.frame(dictLength = dictLength,
totalCover = sum(num$frequency) / sum(den$frequency), # Percentage of total word count covered by top N words
uniqueCover = dictLength / length(den$feature)) # Percentage of unique words covered by top N words
}
wordCoverTable <- rbind(uniqueNeeds(10), uniqueNeeds(50), uniqueNeeds(100),
uniqueNeeds(150), uniqueNeeds(250), uniqueNeeds(500),
uniqueNeeds(1000), uniqueNeeds(2000), uniqueNeeds(3000))
wordCoverPlot <- ggplot(data = wordCoverTable, aes(x = dictLength, y = totalCover)) +
geom_line(color = "#0072B2") +
geom_line(aes(x = dictLength, y = uniqueCover, color = "#CC79A7")) +
labs(title = "Word Coverage",
x = "Top N Words",
y = "% Uniques and % Total Covered by Top N Words") +
theme(legend.position="none")
wordCoverTable
## dictLength totalCover uniqueCover
## 1 10 0.2161577 0.0001263472
## 2 50 0.3815552 0.0006317359
## 3 100 0.4622369 0.0012634718
## 4 150 0.5068697 0.0018952077
## 5 250 0.5567451 0.0031586794
## 6 500 0.6265735 0.0063173588
## 7 1000 0.7001423 0.0126347177
## 8 2000 0.7739532 0.0252694354
## 9 3000 0.8148260 0.0379041530
wordCoverPlot
Now that I have a better feel for the data and how to manipulate it, I need to create the predictive model and Shiny app. Quanteda appears to be the workhorse for this model. The challenge will be to balance accuracy with size and speed, because the file sizes can quickly become enormous and the calculation times lengthy, two attributes that aren’t good for a web app.