Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain.
We will use all of the skills we have learned during the Data Science Specialization to tackle this problem: analysis of text data and natural language processing.
The data is utilized is available from:
[https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip]
This link contains data from 3 sources:
We must begin by downloading, sampling and cleaning the data.
Prepare the environment by loading the necessary packages.
library(readr); library(stringi); library(kableExtra); library(tidytext)
library(dplyr); library(ggplot2)
Download data, if necessary, after checking whether directory has been created.
fileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
fileName <- "data/Coursera-Swiftkey.zip"
if (!file.exists("data")) {
dir.create("data")
download.file(fileURL, destfile = fileName, method = "curl")
dateDownloaded <- date()
unzip(fileName, exdir = "data")
}
fileBlogs <- "data/final/en_US/en_US.blogs.txt"
fileNews <- "data/final/en_US/en_US.news.txt"
fileTwitter <- "data/final/en_US/en_US.twitter.txt"
Blogs_data <- read_lines(fileBlogs)
News_data <- read_lines(fileNews)
Twitter_data <- read_lines(fileTwitter)
After downloading the data, explore the macro summary statstics of the data. These statistics include; file sizes, number of lines by file, total characters by file, maximum characters by line, total words by file and maximum, minimum and average words per line by file.
file_size <- round(file.info(c(fileBlogs, fileNews, fileTwitter))$size / 1024 ^ 2, 2)
lines_file <- sapply(list(Blogs_data, News_data, Twitter_data), length)
char_file <- sapply(list(nchar(Blogs_data), nchar(News_data), nchar(Twitter_data)), sum)
char_line_max <- sapply(list(nchar(Blogs_data), nchar(News_data), nchar(Twitter_data)), max)
words_line <- lapply(list(Blogs_data, News_data, Twitter_data),
function(x) stri_count_words(x))
words_line_sum <- sapply(words_line, sum)
words_line_max <- sapply(words_line, max)
words_line_min <- sapply(words_line, min)
words_line_mean <- sapply(words_line, mean)
sum_stats <- data.frame(
file = c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"),
Size_mb = file_size,
Lines = lines_file,
Total_Char = char_file,
char_max = char_line_max,
Total_words = words_line_sum,
Words_max = words_line_max,
words_min = words_line_min,
words_mean = words_line_mean
)
kable(sum_stats,
row.names = FALSE) %>%
kable_styling(position = "center")
| file | Size_mb | Lines | Total_Char | char_max | Total_words | Words_max | words_min | words_mean |
|---|---|---|---|---|---|---|---|---|
| en_US.blogs.txt | 200.42 | 899288 | 206824505 | 40833 | 37546239 | 6726 | 0 | 41.75107 |
| en_US.news.txt | 196.28 | 1010242 | 203223159 | 11384 | 34762395 | 1796 | 1 | 34.40997 |
| en_US.twitter.txt | 159.36 | 2360148 | 162096031 | 140 | 30093372 | 47 | 1 | 12.75063 |
Due to the large size of these text files and anticipated processing constraints, it is preferential to utilize a sample of the overall data. We will conduct a sample of 1% from each source and combine into a unified data set.
Additionally, we will write this to an .rds file for ease of use going forward.
set.seed(4242)
samp_rate <- 0.01
twit <- sample(Twitter_data, samp_rate * length(Twitter_data), replace = FALSE)
blog <- sample(Blogs_data, samp_rate * length(Blogs_data), replace = FALSE)
news <- sample(News_data, samp_rate * length(News_data), replace = FALSE)
samp <- c(twit, blog, news)
# saveRDS(samp, "data/sample.rds")
The summary statistics on the combined sampled data, 1% from each data source.
sample_stats <- data.frame(
Lines = length(samp),
Total_Char = sum(nchar(samp)),
char_max = max(nchar(samp)),
Total_words = sum(stri_count_words(samp)),
Words_max = max(stri_count_words(samp)),
words_min = min(stri_count_words(samp)),
words_mean = mean(stri_count_words(samp))
)
kable(sample_stats,
row.names = FALSE) %>%
kable_styling(position = "center")
| Lines | Total_Char | char_max | Total_words | Words_max | words_min | words_mean |
|---|---|---|---|---|---|---|
| 42695 | 5709616 | 4736 | 1022714 | 819 | 1 | 23.95395 |
Before, we begin cleaning the data, we will remove some large objects to free up memory.
rm(Blogs_data, News_data, Twitter_data, twit, blog, news, words_line)
We will utilize the functionality of tidytext to clean and tokenize the data. By performing the unnest_tokens function is cleans the data, by removing white spaces, removing punctuation, and converting all the words to lower case.
We have chosen to retain “stopwords”, as we are attempting to predict speech patterns, and if removed would result in nonsensical phases.
We will create separate data frames for unigrams, bigrams and trigrams.
sample_df <- tibble(text = samp)
unigrams <- sample_df %>%
unnest_tokens(unigram, text)
bigrams <- sample_df %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
trigrams <- sample_df %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3)
Removing profanity and other words you do not want to predict.
The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.
Perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora is important.
Here are a visual for the most words in the sample.
unigrams %>%
count(unigram, sort = TRUE) %>%
filter(n > 10000) %>%
mutate(unigram = reorder(unigram, n)) %>%
ggplot(aes(unigram, n, fill = unigram)) +
geom_col() +
coord_flip() +
theme(legend.position = "none")
Let’s also visualize the most popular bigrams and trigrams.
bigrams %>%
count(bigram, sort = TRUE) %>%
filter(n > 1000) %>%
mutate(bigram = reorder(bigram, n)) %>%
ggplot(aes(bigram, n, fill = bigram)) +
geom_col() +
coord_flip() +
theme(legend.position = "none")
trigrams %>%
count(trigram, sort = TRUE) %>%
filter(n > 150) %>%
mutate(trigram = reorder(trigram, n)) %>%
ggplot(aes(trigram, n, fill = trigram)) +
geom_col() +
coord_flip() +
theme(legend.position = "none")
We need to analyze the datasets for where we can trim the data to speed and efficiency.
unigrams_n <- unigrams %>%
count(unigram, sort = TRUE) %>%
mutate(cum_sum_pct = cumsum(n)/sum(n))
bigrams_n <- bigrams %>%
count(bigram, sort = TRUE) %>%
mutate(cum_sum_pct = cumsum(n)/sum(n))
trigrams_n <- trigrams %>%
count(trigram, sort = TRUE) %>%
mutate(cum_sum_pct = cumsum(n)/sum(n))
We will calculate the total number of unique ngrams and how many are needed to cover 90% of all n-gram instances.
c(nrow(unigrams_n), nrow(bigrams_n), nrow(trigrams_n))
FALSE [1] 55485 487831 874340
c(mean(unigrams_n$cum_sum_pct < 0.9), mean(bigrams_n$cum_sum_pct < 0.9), mean(trigrams_n$cum_sum_pct < 0.9))
FALSE [1] 0.1366495 0.7903536 0.8830295
Here we see the size of the data sets increase dramatically with an increase in n.
There will be two attempts to trim the dataset to increase the speed and efficiency.
These implementations will be key to making the final product, efficient and still precise.
We will be implementing a backoff model to predict the next word. This model will first look to trigrams to complete to predict the next word, if only 1 word has been typed or no trigrams are available to predict, then the model will search the bigrams to predict, then onto unigrams.
To build in a scoring system for prediction, we can use a contingent probability metric for initial scoring with a decay factor (lambda = 0.4) for n-1 n-grams. This could help our prediction algorithm, by being able to compare high-probability n-1 ngrams to n ngrams.
We will compare this above method to an simpler smoothing Add-one (Laplachian) for speed, efficiency and precision.
After we have our prediction algorithm created we can move forward with building a shiny app to share our creation with the world.
Wikipedia - Natural Language Processing https://en.wikipedia.org/wiki/Natural_language_processing
Tidy Text Mining with R https://www.tidytextmining.com/tidytext.html
Cornell - Smoothing & Backoff Lecture Slides. http://www.cs.cornell.edu/courses/cs4740/2014sp/lectures/smoothing+backoff.pdf
Profanity Filtering List https://www.freewebheaders.com/bad-words-list-and-page-moderation-words-list-for-facebook/