Data Science Specialization Capstone Milestone Report

0. Understanding the Problem

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain.

We will use all of the skills we have learned during the Data Science Specialization to tackle this problem: analysis of text data and natural language processing.

The data is utilized is available from:

[https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip]

This link contains data from 3 sources:

en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt.

1. Getting and Cleaning the Data

We must begin by downloading, sampling and cleaning the data.

Setup

Prepare the environment by loading the necessary packages.

library(readr); library(stringi); library(kableExtra); library(tidytext)
library(dplyr); library(ggplot2)

Loading the Data

Download data, if necessary, after checking whether directory has been created.

fileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
fileName <- "data/Coursera-Swiftkey.zip"

if (!file.exists("data")) {
        dir.create("data")
        download.file(fileURL, destfile = fileName, method = "curl")
        dateDownloaded <- date()
        unzip(fileName, exdir = "data")
}

fileBlogs <- "data/final/en_US/en_US.blogs.txt"
fileNews <- "data/final/en_US/en_US.news.txt"
fileTwitter <- "data/final/en_US/en_US.twitter.txt"

Blogs_data <- read_lines(fileBlogs)      
News_data <- read_lines(fileNews)       
Twitter_data <- read_lines(fileTwitter)

Data Summary

After downloading the data, explore the macro summary statstics of the data. These statistics include; file sizes, number of lines by file, total characters by file, maximum characters by line, total words by file and maximum, minimum and average words per line by file.

file_size <- round(file.info(c(fileBlogs, fileNews, fileTwitter))$size / 1024 ^ 2, 2)
lines_file <- sapply(list(Blogs_data, News_data, Twitter_data), length)
char_file <- sapply(list(nchar(Blogs_data), nchar(News_data), nchar(Twitter_data)), sum)

char_line_max <- sapply(list(nchar(Blogs_data), nchar(News_data), nchar(Twitter_data)), max)
words_line <- lapply(list(Blogs_data, News_data, Twitter_data), 
                     function(x) stri_count_words(x))

words_line_sum <- sapply(words_line, sum)
words_line_max <- sapply(words_line, max)
words_line_min <- sapply(words_line, min)
words_line_mean <- sapply(words_line, mean)

sum_stats <- data.frame(
        file = c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"),
        Size_mb = file_size,
        Lines = lines_file,
        Total_Char = char_file,
        char_max = char_line_max,
        Total_words = words_line_sum,
        Words_max = words_line_max,
        words_min = words_line_min,
        words_mean = words_line_mean
)

kable(sum_stats,
      row.names = FALSE) %>%
        kable_styling(position = "center")

file	Size_mb	Lines	Total_Char	char_max	Total_words	Words_max	words_min	words_mean
en_US.blogs.txt	200.42	899288	206824505	40833	37546239	6726	0	41.75107
en_US.news.txt	196.28	1010242	203223159	11384	34762395	1796	1	34.40997
en_US.twitter.txt	159.36	2360148	162096031	140	30093372	47	1	12.75063

Sampling the Data

Due to the large size of these text files and anticipated processing constraints, it is preferential to utilize a sample of the overall data. We will conduct a sample of 1% from each source and combine into a unified data set.

Additionally, we will write this to an .rds file for ease of use going forward.

set.seed(4242)

samp_rate <- 0.01

twit <- sample(Twitter_data, samp_rate * length(Twitter_data), replace = FALSE)
blog <- sample(Blogs_data, samp_rate * length(Blogs_data), replace = FALSE)
news <- sample(News_data, samp_rate * length(News_data), replace = FALSE)

samp <- c(twit, blog, news)

# saveRDS(samp, "data/sample.rds")

Sample Summary

The summary statistics on the combined sampled data, 1% from each data source.

sample_stats <- data.frame(
        Lines = length(samp),
        Total_Char = sum(nchar(samp)),
        char_max = max(nchar(samp)),
        Total_words = sum(stri_count_words(samp)),
        Words_max = max(stri_count_words(samp)),
        words_min = min(stri_count_words(samp)),
        words_mean = mean(stri_count_words(samp))
)

kable(sample_stats,
      row.names = FALSE) %>%
        kable_styling(position = "center")

Lines	Total_Char	char_max	Total_words	Words_max	words_min	words_mean
42695	5709616	4736	1022714	819	1	23.95395

Before, we begin cleaning the data, we will remove some large objects to free up memory.

rm(Blogs_data, News_data, Twitter_data, twit, blog, news, words_line)

Cleaning the Data

We will utilize the functionality of tidytext to clean and tokenize the data. By performing the unnest_tokens function is cleans the data, by removing white spaces, removing punctuation, and converting all the words to lower case.

We have chosen to retain “stopwords”, as we are attempting to predict speech patterns, and if removed would result in nonsensical phases.

Tokenization

We will create separate data frames for unigrams, bigrams and trigrams.

sample_df <- tibble(text = samp)

unigrams <- sample_df %>%
  unnest_tokens(unigram, text)

bigrams <- sample_df %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

trigrams <- sample_df %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3)

Profanity Filtering

Removing profanity and other words you do not want to predict.

2. Exploratory Data Analysis

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.

Distribution of Words

Perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora is important.

Here are a visual for the most words in the sample.

unigrams %>%
  count(unigram, sort = TRUE) %>%
  filter(n > 10000) %>%
  mutate(unigram = reorder(unigram, n)) %>%
  ggplot(aes(unigram, n, fill = unigram)) +
  geom_col() +
  coord_flip() +
  theme(legend.position = "none")

Relationship between words

Let’s also visualize the most popular bigrams and trigrams.

bigrams %>%
  count(bigram, sort = TRUE) %>%
  filter(n > 1000) %>%
  mutate(bigram = reorder(bigram, n)) %>%
  ggplot(aes(bigram, n, fill = bigram)) +
  geom_col() +
  coord_flip() +
  theme(legend.position = "none")

trigrams %>%
  count(trigram, sort = TRUE) %>%
  filter(n > 150) %>%
  mutate(trigram = reorder(trigram, n)) %>%
  ggplot(aes(trigram, n, fill = trigram)) +
  geom_col() +
  coord_flip() +
  theme(legend.position = "none")

Analysis for efficiency

We need to analyze the datasets for where we can trim the data to speed and efficiency.

unigrams_n <- unigrams %>%
        count(unigram, sort = TRUE) %>%
        mutate(cum_sum_pct = cumsum(n)/sum(n))

bigrams_n <- bigrams %>%
        count(bigram, sort = TRUE) %>%
        mutate(cum_sum_pct = cumsum(n)/sum(n))

trigrams_n <- trigrams %>%
        count(trigram, sort = TRUE) %>%
        mutate(cum_sum_pct = cumsum(n)/sum(n))

We will calculate the total number of unique ngrams and how many are needed to cover 90% of all n-gram instances.

c(nrow(unigrams_n), nrow(bigrams_n), nrow(trigrams_n))

FALSE [1]  55485 487831 874340

c(mean(unigrams_n$cum_sum_pct < 0.9), mean(bigrams_n$cum_sum_pct < 0.9), mean(trigrams_n$cum_sum_pct < 0.9))

FALSE [1] 0.1366495 0.7903536 0.8830295

Here we see the size of the data sets increase dramatically with an increase in n.

There will be two attempts to trim the dataset to increase the speed and efficiency.

Filter each dataset so that 90% of the n-grams are covered. For instance, we can remove over 86% of the unigrams with minor effect of precision.
Remove any excess n-gram solutions. For instance, if we will propose 3 predictions for the ‘next word’, we only need 3 n+1 n-grams for each n. Example: to complete the phrase ’of the ___’ we may have many different unique trigrams from the 4,000+ bigrams ‘of the’, however, we only need 3 to produce a prediction for the user.

These implementations will be key to making the final product, efficient and still precise.

Future Steps to Implement

We will be implementing a backoff model to predict the next word. This model will first look to trigrams to complete to predict the next word, if only 1 word has been typed or no trigrams are available to predict, then the model will search the bigrams to predict, then onto unigrams.

To build in a scoring system for prediction, we can use a contingent probability metric for initial scoring with a decay factor (lambda = 0.4) for n-1 n-grams. This could help our prediction algorithm, by being able to compare high-probability n-1 ngrams to n ngrams.

We will compare this above method to an simpler smoothing Add-one (Laplachian) for speed, efficiency and precision.

After we have our prediction algorithm created we can move forward with building a shiny app to share our creation with the world.

Acknowledements / References

Wikipedia - Natural Language Processing https://en.wikipedia.org/wiki/Natural_language_processing

Tidy Text Mining with R https://www.tidytextmining.com/tidytext.html

Cornell - Smoothing & Backoff Lecture Slides. http://www.cs.cornell.edu/courses/cs4740/2014sp/lectures/smoothing+backoff.pdf

Profanity Filtering List https://www.freewebheaders.com/bad-words-list-and-page-moderation-words-list-for-facebook/