course10week2

Capstone Dataset

The training data (Coursera-SwiftKey.zip) was downloaded manually from the URL specified in the course reading materials (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip). Working directory was set in which the training data was unzipped.

# Set working directory
setwd("/Users/ltse/Documents/Personal/Coursera/course10")
# Unzip data
unzip("Coursera-SwiftKey.zip")
# Set constants
inputfile1 <- "./final/en_US/en_US.blogs.txt"
inputfile2 <- "./final/en_US/en_US.news.txt"
inputfile3 <- "./final/en_US/en_US.twitter.txt"

The data consists of a “final” folder inside which subfolders for four locale each consisting of three data text files are situated. For this course, the English database was used and it contains three text files:

en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt

Initial quick inspection of the twitter.txt file revealed the use of a lot of emoticons. For modelling preparation, all non-ASCII characters should be removed

The next step was performing data cleaning and extracting high-level information about the three files.

Getting Data

set.seed(12345)

con <- file(inputfile1, "r") # Initiate a connection
blogs <- readLines(con) # Read all lines in the text file
close(con) # Close the connection

con <- file(inputfile2, "r") # Initiate a connection
news <- readLines(con) # Read all lines in the text file
close(con) # Close the connection

con <- file(inputfile3, "r") # Initiate a connection
twitter <- readLines(con) # Read all lines in the text file

## Warning in readLines(con): line 167155 appears to contain an embedded nul

## Warning in readLines(con): line 268547 appears to contain an embedded nul

## Warning in readLines(con): line 1274086 appears to contain an embedded nul

## Warning in readLines(con): line 1759032 appears to contain an embedded nul

close(con) # Close the connection

# Gather high-level facts about the text file
filesize_blogs <- format(object.size(blogs), units = "Mb")
numlines_blogs <- length(blogs)
numwords_blogs <- sum(stri_count_words(blogs))

filesize_news <- format(object.size(news), units = "Mb")
numlines_news <- length(news)
numwords_news <- sum(stri_count_words(news))

filesize_twitter <- format(object.size(twitter), units = "Mb")
numlines_twitter <- length(twitter)
numwords_twitter <- sum(stri_count_words(twitter))

Quick facts about the three text files:

en_US.blogs.txt -

File size = 255.4 Mb Mb
Number of lines = 899288
Number of words = 37546239

en_US.news.txt -

File size = 257.3 Mb Mb
Number of lines = 1010242
Number of words = 34762395

en_US.twitter.txt -

File size = 319 Mb Mb
Number of lines = 2360148
Number of words = 30093372

Cleaning Data

The text files are very big. For modelling purposes, a smaller subset was created from each for training. It was observed that even 2% of the lines would result in vector memory exhaustion, and hence 1.5% of the lines were used instead (via binominal sampling).

Standard cleaning was then applied to the training set including removal of whitespaces, converting all text to lowercase, removal of stop words, removal of punctuation and also word stemming to reduce corpus size.

# Sample a subset of the text files for modelling
train_blogs <- blogs[as.logical(rbinom(numlines_blogs, 1, 0.015))]
train_news <- news[as.logical(rbinom(numlines_news, 1, 0.015))]
train_twitter <- twitter[as.logical(rbinom(numlines_twitter, 1, 0.015))]

# Cleaning

# Remove non-ASCII characters
train_blogs <- gsub("[^\x01-\x7F]", "", train_blogs) 
train_news <- gsub("[^\x01-\x7F]", "", train_news) 
train_twitter <- gsub("[^\x01-\x7F]", "", train_twitter) 

# Use tm package to perform extra text cleaning
corpus_blogs <- VCorpus(VectorSource(train_blogs)) %>% # Convert to a corpus
    tm_map(content_transformer(tolower)) %>% # Change to all lowercase
    tm_map(stripWhitespace) %>% # Strip all double (or more) spaces
    tm_map(removePunctuation) %>% # Remove everything but letters and numbers
    tm_map(removeWords, stopwords("english")) %>% # Remove unhelpful words (e.g.: i, is, me, our, at)
    tm_map(stemDocument) # Reduce word variations by removing words with common stems

# Build corpus for further processing
corpus_news <- VCorpus(VectorSource(train_news)) %>% # Convert to a corpus
    tm_map(content_transformer(tolower)) %>% # Change to all lowercase
    tm_map(stripWhitespace) %>% # Strip all double (or more) spaces
    tm_map(removePunctuation) %>% # Remove everything but letters and numbers
    tm_map(removeWords, stopwords("english")) %>% # Remove unhelpful words (e.g.: i, is, me, our, at)
    tm_map(stemDocument) # Reduce word variations by removing words with common stems

corpus_twitter <- VCorpus(VectorSource(train_twitter)) %>% # Convert to a corpus
    tm_map(content_transformer(tolower)) %>% # Change to all lowercase
    tm_map(stripWhitespace) %>% # Strip all double (or more) spaces
    tm_map(removePunctuation) %>% # Remove everything but letters and numbers
    tm_map(removeWords, stopwords("english")) %>% # Remove unhelpful words (e.g.: i, is, me, our, at)
    tm_map(stemDocument) # Reduce word variations by removing words with common stems

Exploratory Analysis

After text cleaning, a document term matrix for each text file was generated from which word frequencies were derived. The top 30 most frequent words and their frequencies are listed in a bar plot for each text file.

# Generate document term matrix
dtm_blogs <- TermDocumentMatrix(corpus_blogs)
dtm_news <- TermDocumentMatrix(corpus_news)
dtm_twitter <- TermDocumentMatrix(corpus_twitter)

# Convert corpus to matrix and sort by word frequency in descending order
dtm_blogs_matrix <- sort(rowSums(as.matrix(dtm_blogs)), decreasing = TRUE)
dtm_news_matrix <- sort(rowSums(as.matrix(dtm_news)), decreasing = TRUE)
dtm_twitter_matrix <- sort(rowSums(as.matrix(dtm_twitter)), decreasing = TRUE)

# Generate a data frame of just the words and their frequencies
df_blogs <- data.frame(word = names(dtm_blogs_matrix), freq = dtm_blogs_matrix)
df_blogs$cumsum <- cumsum(df_blogs$freq)
df_blogs_unique_words <- nrow(df_blogs)
df_blogs_num_words <- sum(df_blogs$freq)
df_blogs_unique_50 <- min(which(df_blogs$cumsum > 0.5*df_blogs_num_words))
df_blogs_unique_90 <- min(which(df_blogs$cumsum > 0.9*df_blogs_num_words))

df_news <- data.frame(word = names(dtm_news_matrix), freq = dtm_news_matrix)
df_news$cumsum <- cumsum(df_news$freq)
df_news_unique_words <- nrow(df_news)
df_news_num_words <- sum(df_news$freq)
df_news_unique_50 <- min(which(df_news$cumsum > 0.5*df_news_num_words))
df_news_unique_90 <- min(which(df_news$cumsum > 0.9*df_news_num_words))

df_twitter <- data.frame(word = names(dtm_twitter_matrix), freq = dtm_twitter_matrix)
df_twitter$cumsum <- cumsum(df_twitter$freq)
df_twitter_unique_words <- nrow(df_twitter)
df_twitter_num_words <- sum(df_twitter$freq)
df_twitter_unique_50 <- min(which(df_twitter$cumsum > 0.5*df_twitter_num_words))
df_twitter_unique_90 <- min(which(df_twitter$cumsum > 0.9*df_twitter_num_words))

# Plot the words and their frequencies
ggplot(df_blogs[1:30,], aes(x = reorder(word,freq), y = freq)) +
    geom_bar(stat='identity') + 
    labs(title="en_US.blogs.txt - Word Frequencies (Top 30)", 
         x ="Word", y = "Frequency") +
    coord_flip()

ggplot(df_news[1:30,], aes(x = reorder(word,freq), y = freq)) +
    geom_bar(stat='identity') + 
    labs(title="en_US.news.txt - Word Frequencies (Top 30)", 
         x ="Word", y = "Frequency") +
    coord_flip()

ggplot(df_twitter[1:30,], aes(x = reorder(word,freq), y = freq)) +
    geom_bar(stat='identity') + 
    labs(title="en_US.twitter.txt - Word Frequencies (Top 30)", 
         x ="Word", y = "Frequency") +
    coord_flip()

The en_US.blogs.txt corpus contains a total of 285677 words of which 29323 are unique. To cover 50% and 90% of all word instances, 574 and 7400 unique words, respectively, are needed.

The en_US.news.txt corpus contains a total of 285029 words of which 29660 are unique. To cover 50% and 90% of all word instances, 591 and 7604 unique words, respectively, are needed.

The en_US.twitter.txt corpus contains a total of 283858 words of which 26166 are unique. To cover 50% and 90% of all word instances, 548 and 6451 unique words, respectively, are needed.

Below are word cloud representations of the three text files.

wordcloud(words = df_blogs$word, 
          freq = df_blogs$freq,
          min.freq = 1,
          max.words = 100,
          random.order = FALSE,
          rot.per = 0.35,
          colors = brewer.pal(8, "Dark2"))

wordcloud(words = df_news$word, 
          freq = df_news$freq,
          min.freq = 1,
          max.words = 100,
          random.order = FALSE,
          rot.per = 0.35,
          colors = brewer.pal(8, "Dark2"))

wordcloud(words = df_twitter$word, 
          freq = df_twitter$freq,
          min.freq = 1,
          max.words = 100,
          random.order = FALSE,
          rot.per = 0.35,
          colors = brewer.pal(8, "Dark2"))

Next, n-gram analysis was performed. The following plots show bigram and trigram distributions for the three text files.

# en_US.blogs.txt
ngram_df_blogs <- data.frame(text = sapply(corpus_blogs, as.character), 
                             stringsAsFactors = FALSE)
bigramtoken_blogs <- ngram_df_blogs %>%
    unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
    count(bigram, sort = TRUE) %>%
    arrange(desc(n))

## Warning: The `printer` argument is deprecated as of rlang 0.3.0.
## This warning is displayed once per session.

#head(bigramtoken_blogs)
trigramtoken_blogs <- ngram_df_blogs %>%
    unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
    count(trigram, sort = TRUE) %>%
    arrange(desc(n))

# en_US.news.txt
ngram_df_news <- data.frame(text = sapply(corpus_news, as.character), 
                             stringsAsFactors = FALSE)
bigramtoken_news <- ngram_df_news %>%
    unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
    count(bigram, sort = TRUE) %>%
    arrange(desc(n))
#head(bigramtoken_news)
trigramtoken_news <- ngram_df_news %>%
    unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
    count(trigram, sort = TRUE) %>%
    arrange(desc(n))

# en_US.twitter.txt
ngram_df_twitter <- data.frame(text = sapply(corpus_twitter, as.character), 
                             stringsAsFactors = FALSE)
bigramtoken_twitter <- ngram_df_twitter %>%
    unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
    count(bigram, sort = TRUE) %>%
    arrange(desc(n))
#head(bigramtoken_twitter)
trigramtoken_twitter <- ngram_df_twitter %>%
    unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
    count(trigram, sort = TRUE) %>%
    arrange(desc(n))

# Plotting the ngrams
par(mfrow=c(1,2))
ggplot(bigramtoken_blogs[1:30,], aes(x = reorder(bigram,n), y = n)) +
    geom_bar(stat='identity') + 
    labs(title="en_US.blogs.txt - Bigram (Top 30)", 
         x ="Bigram", y = "Frequency") +
    coord_flip()

ggplot(trigramtoken_blogs[1:30,], aes(x = reorder(trigram,n), y = n)) + 
    geom_bar(stat='identity') +
    labs(title="en_US.blogs.txt - Trigram (Top 30)", 
         x ="Trigram", y = "Frequency") +
    coord_flip()

par(mfrow=c(1,2))
ggplot(bigramtoken_news[1:30,], aes(x = reorder(bigram,n), y = n)) +
    geom_bar(stat='identity') + 
    labs(title="en_US.news.txt - Bigram (Top 30)", 
         x ="Bigram", y = "Frequency") + 
    coord_flip()

ggplot(trigramtoken_news[1:30,], aes(x = reorder(trigram,n), y = n)) + 
    geom_bar(stat='identity') + 
    labs(title="en_US.news.txt - Trigram (Top 30)", 
         x ="Trigram", y = "Frequency") + 
    coord_flip()

par(mfrow=c(1,2))
ggplot(bigramtoken_twitter[1:30,], aes(x = reorder(bigram,n), y = n)) +
    geom_bar(stat='identity') + 
    labs(title="en_US.twitter.txt - Bigram (Top 30)", 
         x ="Bigram", y = "Frequency") + 
    coord_flip()

ggplot(trigramtoken_twitter[1:30,], aes(x = reorder(trigram,n), y = n)) + 
    geom_bar(stat='identity') + 
    labs(title="en_US.twitter.txt - Trigram (Top 30)", 
         x ="Trigram", y = "Frequency") + 
    coord_flip()

Observations and Future Steps

From the above exploratory analysis results, the top 10 most frequent words are very similar among the three text files. Another interesting observation is that there are quite a number of typing mistakes within the top word combinations. This indicates that further text cleaning may be necessary.

With the n-gram analysis results, the next step will be to construct prediction models. Recall that these n-grams were constructed from a small subset of the input text files. The resulting model will then be tested on the remaining un-used portion of the text files, from which further tuning may be necessary.

The final model will then be incorporated into a shiny app with input of a text string from the user and output of potential word choice(s) from the prediction model.