Background

In this report, I will be conducting an exploratory analysis on the U.S. news, blogs, and twitter text files. The data can be downloaded here: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Below, you will show you how I loaded the packages and data, and cleaned the data. Then, I will present a basic summary of the data (line and word count by source), followed by plots showing the most frequent unigrams, bigrams, and trigrams. Lastly, I will briefly share my plan on creating a prediction algorithm and Shiny app.

Load Packages and Data

library(stringr)
library(ggplot2)
library(tidytext)
library(dplyr)
library(tibble)
library(tidyr)
library(textclean)

blogs <- readLines("en_US.blogs.txt", warn = FALSE)
news <- readLines("en_US.news.txt", warn = FALSE)
twitter <- readLines("en_US.twitter.txt", warn = FALSE)

Cleaning the Data

Since the text files are very big, I will create a subset of the first 25,000 lines from each source. This will ensure that the R processing times during the exploratory analysis will not be too long. Afterwards, I will convert all letters to lowercase so I reduce duplicate tokens caused by capitalization. Then I will expand contractions, remove punctuation and numbers using the “textclean” package.

subset_data <- c(blogs[1:25000], news[1:25000], twitter[1:25000])
subset_data <- tolower(subset_data)
subset_data <- replace_contraction(subset_data)

Basic Summaries of the Three Files

Next, I will present a table and plots showing the number of lines and words in each source:

Data Summary Table:

DataSummary <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(
    sum(str_count(blogs, "\\S+")),
    sum(str_count(news, "\\S+")),
    sum(str_count(twitter, "\\S+"))
  )
)

DataSummary
##    Source   Lines    Words
## 1   Blogs  899288 37334131
## 2    News 1010206 34371031
## 3 Twitter 2360148 30373543

Word Count in each Source Plot:

ggplot(DataSummary, aes(x = Source, y = Words, fill = Source)) +
  geom_bar(stat = "identity") +
  labs(title = "Word Count by Source", x = "Source", y = "Number of Words") +
  theme_minimal()

Line Count in each Source Plot:

ggplot(DataSummary, aes(x = Source, y = Lines, fill = Source)) +
  geom_bar(stat = "identity") +
  labs(title = "Line Count by Source", x = "Source", y = "Number of Lines") +
  theme_minimal()

From the above tables and plots, we can see that the blogs had the most words, followed by news and twitter. Conversely, twitter had the most lines, followed by news and blogs.

Features of the Data

Here I will present the top 10 most common unigram, bigrams, and trigrams in my subset. I will do this by removing numbers, punctuation, and stop words (e.g., “the”, “and”, “of”, “to”, “is”, “a”, and “in”).

Most Common Unigram (Single Word):

Creating the Unigrams:

data("stop_words")
tokens <- tibble(text = subset_data) %>%
  unnest_tokens(word, text) %>%
  filter(str_detect(word, "^[a-z]+$")) %>%
  anti_join(stop_words, by = "word")
word_freq <- tokens %>%
  count(word, sort = TRUE)

Plot of the Most Common Unigrams:

word_freq %>%
  slice_max(n, n = 10) %>%
  ggplot(aes(x = reorder(word, n), y = n)) +
  geom_col(fill = "red") +
  coord_flip() +
  labs(title = "Top 10 Most Common Unigrams", x = "Unigrams", y = "Count") +
  theme_minimal()

Thus, we can see that the most common unigrams are: “time”, “people”, “day”, “love”, “life”, “home”, “week”, “school”, “world” and “game”.

Most Common Bigrams (2-Gram):

Creating the Bigrams:

bigrams <- tibble(text = subset_data) %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(str_detect(word1, "^[a-z]+$"),
         str_detect(word2, "^[a-z]+$")) %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word) %>%
  unite(bigram, word1, word2, sep = " ") %>%
  count(bigram, sort = TRUE)

Plot of the Most Common Bigrams:

bigrams %>%
  slice_max(n, n = 10) %>%
  ggplot(aes(x = reorder(bigram, n), y = n)) +
  geom_col(fill = "blue") +
  coord_flip() +
  labs(title = "Top 10 Most Common Bigrams", x = "Bigrams", y = "Count") +
  theme_minimal()

Thus, we can see that the most common bigrams are: “st louis”, “los angeles”, “san francisco”, “health care”, “san diego”, “happy birthday”, “social media”, “ice cream”, “white house”, and “weeks ago”.

Most Common Trigrams (3-Grams)

Creating the Trigrams:

trigrams <- tibble(text = subset_data) %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
  filter(str_detect(word1, "^[a-z]+$"),
         str_detect(word2, "^[a-z]+$"),
         str_detect(word3, "^[a-z]+$")) %>%
  unite(trigram, word1, word2, word3, sep = " ") %>%
  count(trigram, sort = TRUE)

Plot of the Most Common Trigrams:

trigrams %>%
  slice_max(n, n = 10) %>%
  ggplot(aes(x = reorder(trigram, n), y = n)) +
  geom_col(fill = "green") +
  coord_flip() +
  labs(title = "Top 10 Most Common Trigrams", x = "Trigrams", y = "Count") +
  theme_minimal()

Thus, we can see that the most common trigrams are: “one of the”, “i do not”, “a lot of”, “it is a”, “i am not”, “it is not”, “i have been”, “there is a”, “i can not”, and “the end of”.

Next Steps

The next steps for creating my prediction algorithm and Shiny app involve constructing n-gram language models from my cleaned corpus. I will attempt 6 different models to predict the next word. I will start with 6-grams, and then move down to 5-grams, 4-grams, 3-grams, 2-grams, and 1-grams. This is known as a backoff approach where I use the highest-order n-gram available and progressively back down to smaller n-grams.

Because many possible word combinations may not appear in the dataset, I will also explore smoothing techniques to help assign small probabilities to unseen word sequences and improve the robustness of the predictions. I may revise the number of n-gram models I use depending on predictive accuracy and how long it takes to run. My goal is to reduce memory usage as much as possible and run efficiently without significantly reducing the performance of the model.

I will also need to experiment with the size of the dataset used to build my model. I could not run the entire dataset for this exploratory analysis, I arbitrarily chose a subset of 25,000 items. Perhaps, this may not be enough to make the most accurate predictions, and I will have to increase this number if computational resources allow. The dataset will then be randomly divided into training, validation, and test sets using a 60/30/10 split in order to evaluate the model’s performance.

Finally, I will create a Shiny application with my predictive model. I want this to be very user-friendly and simple. There will be a text box for users to input a word or phrase, and underneath they will receive real-time next-word predictions based on my model.