Synopsis

This is a milestone report for the Coursera Data Science Capstone course. The goal of this report is just to display that I have gotten used to working with the data and that I am on track to create the prediction algorithm. I downloaded, loaded up, sampled, and tidied the data for analysis. I then ran some basic n-gram analysis.

Downloading the data and loading it in

Using the file connection and readLines functions, I loaded in all three US-English texts. The three objects correspond with the data sources: twit for Twitter, news for news sources, and blog for blog sources. The skipNul argument was important because there were a number of dysfunctional lines.

# Download and unzip dataset
if (!file.exists("Coursera-SwiftKey.zip")) {
  fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
  download.file(fileUrl, destfile = "./Coursera-SwiftKey.zip") }
if (!file.exists("final")) { unzip("Coursera-SwiftKey.zip") }

# Load data

con <- file("final/en_US/en_US.twitter.txt", "rb")
twit <- readLines(con, skipNul=TRUE, encoding="UTF-8")
close(con)

con <- file("final/en_US/en_US.news.txt", "rb")
news <- readLines(con, skipNul=TRUE, encoding="UTF-8")
close(con)

con <- file("final/en_US/en_US.blogs.txt", "rb")
blog <- readLines(con, skipNul=TRUE, encoding="UTF-8")
close(con)

Exploratory Data Analysis

Basic summary statistics about the data sets

I manually created a table summarizing the three text files. Two key functions used the tokenizers package: count_words and count_characters. I also use the tokenize_fastestword() function from the quanteda package. Across all three files, there are sum(dataSum$numLines) lines and sum(dataSum$numWords) words. Perhaps not surprisingly, the longest line for Twitter is 140 characters, and Twitter had by far the shortest average line length, at ~69 characters. This was a rather heavy calculation and took several minutes to run, which doesn’t bode well for more complicated functions later on.

library(tokenizers)
library(quanteda)

dataSum <- data.frame(files = c("Twitter", "News", "Blogs"),
                      numLines = c(length(twit),
                                   length(news),
                                   length(blog)),
                      numWords = c(sum(count_words(twit)),
                                   sum(count_words(news)),
                                   sum(count_words(blog))),
                      longestLine = c(max(count_characters(twit)),
                                      max(count_characters(news)),
                                      max(count_characters(blog))),
                      meanLineLength = c(mean(count_characters(twit)),
                                         mean(count_characters(news)),
                                         mean(count_characters(blog))),
                      uniqueWords = c(length(unique(unlist(tokenize_fastestword(twit)))),
                                      length(unique(unlist(tokenize_fastestword(news)))),
                                      length(unique(unlist(tokenize_fastestword(blog))))))

dataSum
##     files numLines numWords longestLine meanLineLength uniqueWords
## 1 Twitter  2360148 30093413         140       68.68043     1290173
## 2    News  1010242 34762395       11384      201.16284      876770
## 3   Blogs   899288 37546239       40833      229.98668     1103503

At hundreds of megabytes, the full files would be too large for the rest of the exercise. I sampled from them and combined them into a single corpus to make analysis easier. There are still 60,000 total lines, which I hypothesize will be sufficient. This chunk introduces the quanteda package, which allows us to create a corpus object using the corpus() function.

library(quanteda)

# Sample data and recombine as one corpus

set.seed(1984)
twitSample <- sample(twit, size=20000)
newsSample <- sample(news, size=20000)
blogSample <- sample(blog, size=20000)

totalSample <- c(twitSample, newsSample, blogSample)
writeLines(totalSample, "./totalSample.txt")
textcon <- file("totalSample.txt")
corpus <- readLines(textcon)
close(textcon)
rm(textcon)

# Turn files into corpora

corpus <- corpus(corpus) # Read as a list for easier cleaning

Then, as requested by the assignment, I filtered out profanity. I manually input four words of profanity as regular expressions. I then used quanteda’s corpus_trim() function to exclude all lines that had those regex. I excluded entire lines instead of words, because otherwise it could produce odd ngrams.

I created a table to see the effects of removing lines with profanity. It only removed 912 sentences, equivalent to <1% of the total set.

fourWords <- "[Ss][Hh][Ii][Tt]|[Pp][Ii][Ss][Ss]|[Ff][Uu][Cc][Kk]|[Cc][Uu][Nn][Tt]"
corpusClean <- corpus_trim(corpus, exclude_pattern = fourWords, what = "sentences")

profanityTable <- data.frame(WithProfanity = c(sum(count_sentences(corpus))),
                             noProfanity = c(sum(count_sentences(corpusClean))),
                             sentencesRemoved = c(sum(count_sentences(corpus)) - sum(count_sentences(corpusClean))),
                             percReduction = c(1 - sum(count_sentences(corpusClean) / sum(count_sentences(corpus)))))

profanityTable
##   WithProfanity noProfanity sentencesRemoved percReduction
## 1        124492      123580              912   0.007325772

Finally I used the tokens() function from the quanteda package to remove punctuation, numbers, symbols, and URLs. These characters would get in the way of analysis.

tokensClean <- tokens(corpusClean, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, remove_url = TRUE)

Ngram Data Analysis

Comfortable with the dataset, I decided to do some more visual exploration. I plotted a histogram of word frequency. It excluded stopwords like “is” and “the,” which would dominate the frequency. The histogram demonstrates the extreme concentrations of word frequency. A very small number of words appear thousands of times, whereas tens of thousands of words appear very few times.

This analysis relies on more packages. I use quanteda functions for all of the NLP (tokens_remove(), dfm(), textstat_frequency(), etc.) I use ggplot2 (via Tidyverse) for plotting.

library(tidyverse)
tokensClean %>%
  tokens_remove(pattern = stopwords('en')) %>%
  dfm(tolower = TRUE) %>%
  textstat_frequency() %>%
  ggplot(aes(x = frequency)) +
  geom_histogram(binwidth = 1, aes(fill = "red")) +
  theme(legend.position = "none") +
  labs(x = "Number of Appearances of Word", y = "Frequency") +
  scale_y_log10()

I was curious to see what the most frequent words (excl. stopwords) were, so I plotted the frequency of the top 20. The word ‘said’ was the runaway winner, with nearly 6,000 appearances.

tokensClean %>%
  tokens_remove(pattern = stopwords('en')) %>%
  dfm(tolower = TRUE) %>%
  textstat_frequency(n = 20) %>%
  ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point() + 
  coord_flip() +
  labs(x = "Number of word appearances", y = "Frequency")

Next I looked at the most frequent bigrams. I again excluded stopwords, though I plan to leave stopwords in the final model. Not surprisingly, the frequencies fell dramatically, with the top bigram (“last year”) appearing a little over 350 times. As the n in ngram increases, the frequencies fall.

tokensClean %>%
  tokens_remove(pattern = stopwords('en')) %>%
  tokens_ngrams(n = 2) %>%
  dfm(tolower = TRUE) %>%
  textstat_frequency(n = 20) %>%
  ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point() + 
  coord_flip() +
  labs(x = "Number of Appearances of N-gram", y = "Frequency")

Finally I checked to see how many unique words would be needed to cover 50%, 90% of all word instances. This time it does include stopwords. This would give me a sense of how small I could make my model without losing too much accuracy. What I can see from this plot and table is that a few words account for a large percentage of total words. The top 10 words account for over 20% of total words! A mere 150 words account for over 50% of total words. The marginal coverage diminishes rapidly, though. At 1000 words, we achieve 70% coverage – 6.6x as many words for only 20% more coverage. To surpass 80% coverage, we need 3000 words.

uniqueNeeds <- function(dictLength) {
  tokensClean %>%
    dfm(tolower = TRUE) %>%
    textstat_frequency(n = dictLength) -> num
  
  tokensClean %>%
    dfm(tolower = TRUE) %>%
    textstat_frequency() -> den
  
  data.frame(dictLength = dictLength,
             totalCover = sum(num$frequency) / sum(den$frequency), # Percentage of total word count covered by top N words
             uniqueCover = dictLength / length(den$feature)) # Percentage of unique words covered by top N words
   
}

wordCoverTable <- rbind(uniqueNeeds(10), uniqueNeeds(50), uniqueNeeds(100),
                        uniqueNeeds(150), uniqueNeeds(250), uniqueNeeds(500),
                        uniqueNeeds(1000), uniqueNeeds(2000), uniqueNeeds(3000))

wordCoverPlot <- ggplot(data = wordCoverTable, aes(x = dictLength, y = totalCover)) +
  geom_line(color = "#0072B2") +
  geom_line(aes(x = dictLength, y = uniqueCover, color = "#CC79A7")) +
  labs(title = "Word Coverage",
       x = "Top N Words",
       y = "% Uniques and % Total Covered by Top N Words") +
  theme(legend.position="none")

wordCoverTable
##   dictLength totalCover  uniqueCover
## 1         10  0.2161577 0.0001263472
## 2         50  0.3815552 0.0006317359
## 3        100  0.4622369 0.0012634718
## 4        150  0.5068697 0.0018952077
## 5        250  0.5567451 0.0031586794
## 6        500  0.6265735 0.0063173588
## 7       1000  0.7001423 0.0126347177
## 8       2000  0.7739532 0.0252694354
## 9       3000  0.8148260 0.0379041530
wordCoverPlot

Next steps

Now that I have a better feel for the data and how to manipulate it, I need to create the predictive model and Shiny app. Quanteda appears to be the workhorse for this model. The challenge will be to balance accuracy with size and speed, because the file sizes can quickly become enormous and the calculation times lengthy, two attributes that aren’t good for a web app.