Data Science Capstone Milestone Report

Synopsis

This is a milestone report for the Coursera Data Science Capstone course. The goal of this report is just to display that I have gotten used to working with the data and that I am on track to create the prediction algorithm. I downloaded, loaded up, sampled, and tidied the data for analysis. I then ran some basic n-gram analysis.

Downloading the data and loading it in

Using the file connection and readLines functions, I loaded in all three US-English texts. The three objects correspond with the data sources: twit for Twitter, news for news sources, and blog for blog sources. The skipNul argument was important because there were a number of dysfunctional lines.

# Download and unzip dataset
if (!file.exists("Coursera-SwiftKey.zip")) {
  fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
  download.file(fileUrl, destfile = "./Coursera-SwiftKey.zip") }
if (!file.exists("final")) { unzip("Coursera-SwiftKey.zip") }

# Load data

con <- file("final/en_US/en_US.twitter.txt", "rb")
twit <- readLines(con, skipNul=TRUE, encoding="UTF-8")
close(con)

con <- file("final/en_US/en_US.news.txt", "rb")
news <- readLines(con, skipNul=TRUE, encoding="UTF-8")
close(con)

con <- file("final/en_US/en_US.blogs.txt", "rb")
blog <- readLines(con, skipNul=TRUE, encoding="UTF-8")
close(con)

Exploratory Data Analysis

Basic summary statistics about the data sets

I manually created a table summarizing the three text files. Two key functions used the tokenizers package: count_words and count_characters. I also use the tokenize_fastestword() function from the quanteda package. Across all three files, there are sum(dataSum$numLines) lines and sum(dataSum$numWords) words. Perhaps not surprisingly, the longest line for Twitter is 140 characters, and Twitter had by far the shortest average line length, at ~69 characters. This was a rather heavy calculation and took several minutes to run, which doesn’t bode well for more complicated functions later on.

library(tokenizers)
library(quanteda)

dataSum <- data.frame(files = c("Twitter", "News", "Blogs"),
                      numLines = c(length(twit),
                                   length(news),
                                   length(blog)),
                      numWords = c(sum(count_words(twit)),
                                   sum(count_words(news)),
                                   sum(count_words(blog))),
                      longestLine = c(max(count_characters(twit)),
                                      max(count_characters(news)),
                                      max(count_characters(blog))),
                      meanLineLength = c(mean(count_characters(twit)),
                                         mean(count_characters(news)),
                                         mean(count_characters(blog))),
                      uniqueWords = c(length(unique(unlist(tokenize_fastestword(twit)))),
                                      length(unique(unlist(tokenize_fastestword(news)))),
                                      length(unique(unlist(tokenize_fastestword(blog))))))

dataSum

##     files numLines numWords longestLine meanLineLength uniqueWords
## 1 Twitter  2360148 30093413         140       68.68043     1290173
## 2    News  1010242 34762395       11384      201.16284      876770
## 3   Blogs   899288 37546239       40833      229.98668     1103503

At hundreds of megabytes, the full files would be too large for the rest of the exercise. I sampled from them and combined them into a single corpus to make analysis easier. There are still 60,000 total lines, which I hypothesize will be sufficient. This chunk introduces the quanteda package, which allows us to create a corpus object using the corpus() function.

library(quanteda)

# Sample data and recombine as one corpus

set.seed(1984)
twitSample <- sample(twit, size=20000)
newsSample <- sample(news, size=20000)
blogSample <- sample(blog, size=20000)

totalSample <- c(twitSample, newsSample, blogSample)
writeLines(totalSample, "./totalSample.txt")
textcon <- file("totalSample.txt")
corpus <- readLines(textcon)
close(textcon)
rm(textcon)

# Turn files into corpora

corpus <- corpus(corpus) # Read as a list for easier cleaning

Then, as requested by the assignment, I filtered out profanity. I manually input four words of profanity as regular expressions. I then used quanteda’s corpus_trim() function to exclude all lines that had those regex. I excluded entire lines instead of words, because otherwise it could produce odd ngrams.

I created a table to see the effects of removing lines with profanity. It only removed 912 sentences, equivalent to <1% of the total set.

fourWords <- "[Ss][Hh][Ii][Tt]|[Pp][Ii][Ss][Ss]|[Ff][Uu][Cc][Kk]|[Cc][Uu][Nn][Tt]"
corpusClean <- corpus_trim(corpus, exclude_pattern = fourWords, what = "sentences")

profanityTable <- data.frame(WithProfanity = c(sum(count_sentences(corpus))),
                             noProfanity = c(sum(count_sentences(corpusClean))),
                             sentencesRemoved = c(sum(count_sentences(corpus)) - sum(count_sentences(corpusClean))),
                             percReduction = c(1 - sum(count_sentences(corpusClean) / sum(count_sentences(corpus)))))

profanityTable

##   WithProfanity noProfanity sentencesRemoved percReduction
## 1        124492      123580              912   0.007325772

Finally I used the tokens() function from the quanteda package to remove punctuation, numbers, symbols, and URLs. These characters would get in the way of analysis.

tokensClean <- tokens(corpusClean, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, remove_url = TRUE)

Ngram Data Analysis

Comfortable with the dataset, I decided to do some more visual exploration. I plotted a histogram of word frequency. It excluded stopwords like “is” and “the,” which would dominate the frequency. The histogram demonstrates the extreme concentrations of word frequency. A very small number of words appear thousands of times, whereas tens of thousands of words appear very few times.

This analysis relies on more packages. I use quanteda functions for all of the NLP (tokens_remove(), dfm(), textstat_frequency(), etc.) I use ggplot2 (via Tidyverse) for plotting.

library(tidyverse)
tokensClean %>%
  tokens_remove(pattern = stopwords('en')) %>%
  dfm(tolower = TRUE) %>%
  textstat_frequency() %>%
  ggplot(aes(x = frequency)) +
  geom_histogram(binwidth = 1, aes(fill = "red")) +
  theme(legend.position = "none") +
  labs(x = "Number of Appearances of Word", y = "Frequency") +
  scale_y_log10()

I was curious to see what the most frequent words (excl. stopwords) were, so I plotted the frequency of the top 20. The word ‘said’ was the runaway winner, with nearly 6,000 appearances.

tokensClean %>%
  tokens_remove(pattern = stopwords('en')) %>%
  dfm(tolower = TRUE) %>%
  textstat_frequency(n = 20) %>%
  ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point() + 
  coord_flip() +
  labs(x = "Number of word appearances", y = "Frequency")

Next I looked at the most frequent bigrams. I again excluded stopwords, though I plan to leave stopwords in the final model. Not surprisingly, the frequencies fell dramatically, with the top bigram (“last year”) appearing a little over 350 times. As the n in ngram increases, the frequencies fall.

tokensClean %>%
  tokens_remove(pattern = stopwords('en')) %>%
  tokens_ngrams(n = 2) %>%
  dfm(tolower = TRUE) %>%
  textstat_frequency(n = 20) %>%
  ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point() + 
  coord_flip() +
  labs(x = "Number of Appearances of N-gram", y = "Frequency")

Finally I checked to see how many unique words would be needed to cover 50%, 90% of all word instances. This time it does include stopwords. This would give me a sense of how small I could make my model without losing too much accuracy. What I can see from this plot and table is that a few words account for a large percentage of total words. The top 10 words account for over 20% of total words! A mere 150 words account for over 50% of total words. The marginal coverage diminishes rapidly, though. At 1000 words, we achieve 70% coverage – 6.6x as many words for only 20% more coverage. To surpass 80% coverage, we need 3000 words.

uniqueNeeds <- function(dictLength) {
  tokensClean %>%
    dfm(tolower = TRUE) %>%
    textstat_frequency(n = dictLength) -> num
  
  tokensClean %>%
    dfm(tolower = TRUE) %>%
    textstat_frequency() -> den
  
  data.frame(dictLength = dictLength,
             totalCover = sum(num$frequency) / sum(den$frequency), # Percentage of total word count covered by top N words
             uniqueCover = dictLength / length(den$feature)) # Percentage of unique words covered by top N words
   
}

wordCoverTable <- rbind(uniqueNeeds(10), uniqueNeeds(50), uniqueNeeds(100),
                        uniqueNeeds(150), uniqueNeeds(250), uniqueNeeds(500),
                        uniqueNeeds(1000), uniqueNeeds(2000), uniqueNeeds(3000))

wordCoverPlot <- ggplot(data = wordCoverTable, aes(x = dictLength, y = totalCover)) +
  geom_line(color = "#0072B2") +
  geom_line(aes(x = dictLength, y = uniqueCover, color = "#CC79A7")) +
  labs(title = "Word Coverage",
       x = "Top N Words",
       y = "% Uniques and % Total Covered by Top N Words") +
  theme(legend.position="none")

wordCoverTable

##   dictLength totalCover  uniqueCover
## 1         10  0.2161577 0.0001263472
## 2         50  0.3815552 0.0006317359
## 3        100  0.4622369 0.0012634718
## 4        150  0.5068697 0.0018952077
## 5        250  0.5567451 0.0031586794
## 6        500  0.6265735 0.0063173588
## 7       1000  0.7001423 0.0126347177
## 8       2000  0.7739532 0.0252694354
## 9       3000  0.8148260 0.0379041530

wordCoverPlot

Data Science Capstone Milestone Report

Jeff B

5/26/2020

Synopsis

Downloading the data and loading it in

Exploratory Data Analysis

Basic summary statistics about the data sets

Ngram Data Analysis

Next steps