This report of the assignment of week two in the Coursera Data Science Capstone project. In this project I will develop a shiny app of a predictive text model. The app should display the predicted next word based on a user input text. The data to train the prediction model is provided by SwiftKey and contains text of news, blogs, and tweets.
This report describes the exploratory data analysis performed on the input data (only texts in english). The goal of this task is to understand the basic relationships in the data and to prepare first linguistic models. Questions to consider are:
1. Some words are more frequent than others - what are the distributions of word frequencies?
2. What are the frequencies of 2-grams and 3-grams in the dataset?
3. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
4. How do you evaluate how many of the words come from foreign languages?
5. Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?
I will follow the principles of tidy data sets and use the tidyverse and tidytext packages for basic data manipulation. For further text mining tasks after this explorative data analysis I might use the dedicated tm and quanteda packages, but for the work detailed below I actually don’t need them.
The input data sets are huge and contain xx text lines with maximal words per line. I take a random sample comprising 25k observation per source to perfom my analysis. I make sure to free memory space by removing objects not needed anymore and calling the garbage collector.
# read in all files
file_to_read <- file("data/en_US/en_US.blogs.txt", "r")
blogs <- readLines(file_to_read, encoding = "UTF-8", skipNul = TRUE)
close(file_to_read)
file_to_read <- file("data/en_US/en_US.news.txt", "r")
news <- readLines(file_to_read, encoding = "UTF-8", skipNul = TRUE)
## Warning in readLines(file_to_read, encoding = "UTF-8", skipNul = TRUE):
## unvollständige letzte Zeile in 'data/en_US/en_US.news.txt' gefunden
close(file_to_read)
file_to_read <- file("data/en_US/en_US.twitter.txt", "r")
twitter <- readLines(file_to_read, encoding = "UTF-8", skipNul = TRUE)
close(file_to_read)
# gather data into one data set
data_full <- rbind(
tibble(source = "blogs", text = blogs) %>%
mutate(line = row_number(), n_words = str_count(blogs, boundary("word"))),
tibble(source = "news", text = news) %>%
mutate(line = row_number(), n_words = str_count(news, boundary("word"))),
tibble(source = "twitter", text = twitter) %>%
mutate(line = row_number(), n_words = str_count(twitter, boundary("word")))
) %>% select(source, line, n_words, text)
data_full %>% group_by(source) %>%
summarize(lines = max(line),
n_words_min = min(n_words),
n_words_mean = mean(n_words),
n_words_max = max(n_words))
## # A tibble: 3 x 5
## source lines n_words_min n_words_mean n_words_max
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 blogs 899288 0 41.8 6726
## 2 news 77259 1.00 34.6 1123
## 3 twitter 2360148 1.00 12.8 47.0
# sample the data to get 25k observation per source
data <- data_full %>% select(-n_words) %>%
group_by(source) %>% sample_n(25000) %>% ungroup()
# free memory
memory.size()
## [1] 1174.9
rm(blogs)
rm(news)
rm(twitter)
rm(data_full)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1825572 97.5 6619081 353.5 7015486 374.7
## Vcells 6771689 51.7 89246836 680.9 109937245 838.8
memory.size()
## [1] 569.6
# get a tidy word data set
data_uni <- data %>% unnest_tokens(word, text, token = "words")
To convert the text lines to single words or n-grams I use the unnest_tokens() function of the tidytext package which converts all text to lower case and removes punctuation. Then I create two word clouds, one with all words and one with common (uninteresting) words removed. Common words can be removed with an anti-join with a collection of stop-words. These two word clouds show how dominant stop-words are and that further cleaning is indicated. The word cloud without the stop-words shows numbers and terms still including punctuation (e.g. u.s, p.m).
I remove all numbers and further punctuation from the base text data set and build uni-, bi-, and trigrams. Then I plot the 30 most common n-grams. Using the unigrams, I show that there are 2225k word instances and 81k unique words in total in the texts. The most common 7807 unique words make up 90% of all word instances. This is only a fration of 9.6% of all unique words. Similarly, only 152 unique word (that is 0.2% of all unique words) make up 50% of all word instances. These numbers would change dramatically if the stop-words were removed.
The language of texts can be determined with the package textcat. The input needs to be a text sample, a single word is not enough. The language detection doesn’t work well, but generally, it confirms that the texts are actuall written in english.
# {r echo = TRUE, out.width = "100%"}
# remove stop-words (just to get a better grasp)
data_uni_red <- data_uni %>% anti_join(stop_words, by = "word")
# word clouds
set.seed(007)
data_uni %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
title("with stop-words")
data_uni_red %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
title("without stop-words")
# remove numbers and punctuation
data_clean <- data %>%
mutate(text = str_replace(text, "[[:punct:]]", " ")) %>%
mutate(text = str_replace(text, "[[:digit:]]", " ")) %>%
mutate(text = str_replace(text, "\\s+", " "))
# get clean n-grams
data_uni <- data_clean %>%
unnest_tokens(term, text, token = "words") %>%
count(term) %>% arrange(desc(n)) %>%
mutate(prop = n / sum(n))
data_bi <- data_clean %>%
unnest_tokens(term, text, token = "ngrams", n = 2) %>%
count(term) %>% arrange(desc(n)) %>%
mutate(prop = n / sum(n))
data_tri <- data_clean %>%
unnest_tokens(term, text, token = "ngrams", n = 3) %>%
count(term) %>% arrange(desc(n)) %>%
mutate(prop = n / sum(n))
# plot the most frequent n-grams
data_uni[1:30, ] %>% ggplot(aes(x = reorder(term, prop), y = prop)) +
geom_bar(stat = "identity") + coord_flip() +
labs(x = "Term", y = "Proportion of all terms", title = "30 most frequent unigrams")
data_bi[1:30, ] %>% ggplot(aes(x = reorder(term, prop), y = prop)) +
geom_bar(stat = "identity") + coord_flip() +
labs(x = "Term", y = "Proportion of all terms", title = "30 most frequent bigrams")
data_tri[1:30, ] %>% ggplot(aes(x = reorder(term, prop), y = prop)) +
geom_bar(stat = "identity") + coord_flip() +
labs(x = "Term", y = "Proportion of all terms", title = "30 most frequent trigrams")
# How many unique words are needed to cover 50% or 90% of all words?
counts <- tibble(
key = "word instances: absolute count",
n_tot = sum(data_uni$n),
n_90 = round(sum(data_uni$n) * 0.9, 0),
n_50 = round(sum(data_uni$n) * 0.5, 0)
)
counts <- add_row(
counts,
key = "unique words: absolute count",
n_tot = data_uni %>% nrow(),
n_90 = data_uni %>% mutate(prop_cum = cumsum(prop)) %>%
filter(prop_cum <= 0.9) %>% nrow(),
n_50 = data_uni %>% mutate(prop_cum = cumsum(prop)) %>%
filter(prop_cum <= 0.5) %>% nrow()
)
(counts <- add_row(
counts,
key = "unique words: relative proportion",
n_tot = 1,
n_90 = counts$n_90[2] / counts$n_tot[2],
n_50 = counts$n_50[2] / counts$n_tot[2]
))
## # A tibble: 3 x 4
## key n_tot n_90 n_50
## <chr> <dbl> <dbl> <dbl>
## 1 word instances: absolute count 2231640 2008476 1115820
## 2 unique words: absolute count 81615 7848 152
## 3 unique words: relative proportion 1.00 0.0962 0.00186
# language detection (time intensive and not very insightful)
# library(textcat)
# data %>% mutate(language = textcat(text)) %>%
# group_by(language) %>% summarize(n = n()) %>% arrange(desc(n))
The explorative data analysis has shown, that the provided input data is too large to be processed and needs to be sampled. The assumption, that the input texts are indeed written in english holds true. As expected, most of the words in the texts consists only of a few very common words, such as “the”, “and”, so-called stop-words. In many text mining application, one chooses to remove those stop-words. But as the goal of this project ultimately is an app that displays the predicted next word based on a user input text, I intentionally keep all stop words!
Further analysis shows, that the texts also contain numbers quite frequently and that there are some special cases where words contain punctuation. I elect to remove all numbers and superfluous punctuation and will give a hint in my app that the user input needs to be in english and that all numbers and punctuation will be removed from it.
The most frequent unigrams are: “the”, “do”, “and”. The most frequent bigrams are: “of the”, “in the”, “to the”. The most frequent trigrams are: “one of the”, “a lot of”, “as well as”.
The next steps will be to develop a prediction model based on the frequency of the n-grams found and to design a shiny app containing a user input and the predicted next word. The prediction model will work like this:
1. read in the user input and determine the number of words in it (i)
2. check the respective n-gram frequency table where n = i + 1 and return the last word of the most frequent term starting with the user input
3. if no match is found, remove the first word of the user input and match with the (n-1)-gram table (and so on)
4. if no exact match is found at all down to the 2-gram table, perform some sort of fuzzy string match in the 2-gram table
The most critical point for the app will be its performance. I need to determine how many n-gram tables I want to include (e.g. up to n = 5?) and how much I need to truncate the n-gram tables (e.g remove n-grams with a proportional frequency below a certain threshold).