Summary

The Data Science Specialization Capstone Project consists of analyzing a large corpus of text documents, provided by SwiftKey, in order to discover the structure of the data, and how words are put together. It covers cleaning and analyzing the text data. As a continuation of the present analysis, there will be a next-word-prediction model, represented interactively through a Shiny app.

Analysis

Required packages:

library(caTools)
library(quanteda)
library(ggplot2)

Loading the text files into R. Note that the text files are placed in a “data” folder within the working directory.

blog_file <- readLines("data/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news_file <- readLines("data/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter_file <- readLines("data/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

Exploratory data analysis:

Number of words on each line, within each element:
sum(sapply(strsplit(blog_file, " "), length))
## [1] 37334131
sum(sapply(strsplit(news_file, " "), length))
## [1] 2643969
sum(sapply(strsplit(twitter_file, " "), length))
## [1] 30373583

As the information from the word count shows, the blog element has 37.3 million words, the news element - 2.6 million words, and the twitter element - 30.4 million words.

Number of lines in each element:
length(blog_file)
## [1] 899288
length(news_file)
## [1] 77259
length(twitter_file)
## [1] 2360148

The blog element has a total of 899,288 lines, the news element - 77,259, and the twitter element - 2.4 million lines.

Summary of the distribution of the text lengths within the three elements:
summary(nchar(blog_file))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      47     156     230     329   40833
summary(nchar(news_file))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0   111.0   186.0   202.4   270.0  5760.0
summary(nchar(twitter_file))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   37.00   64.00   68.68  100.00  140.00

As the summary statistics shows, the distribution of the length of the lines within each element varies from 1 to 40,833 characters in the blog element, from 2 to 5,760 characters in the news element, and from 2 to 140 characters in the twitter element.

Before we start cleaning the data, we need to subsample each text element in order to save computation power. From each element, we will extract 1% of the lines, proportionally. Later, we will furhter perform data slicing and use 70% of the lines for training and 30% for testing.

set.seed(1234)

sub_blog_file <- sample(blog_file, size = round(length(blog_file) * .01, 0), replace = TRUE)
sub_news_file <- sample(news_file, size = round(length(news_file) * .01, 0), replace = TRUE)
sub_twitter_file <- sample(twitter_file, size = round(length(twitter_file) * .01, 0), replace = TRUE)

raw_data <- c(sub_blog_file, sub_news_file, sub_twitter_file)

inTrain <- sample.split(raw_data, 0.7)
training <- subset(raw_data, inTrain == TRUE)
testing <- subset(raw_data, inTrain == FALSE)

After we have our raw training element of lines, we need to go through the text analytics data pre-processing pipeline.

Tokenize the text and remove numbers, punctuation, symbols, and hyphens:
train.tokens <- tokens(training, what = "word",
                       remove_numbers = TRUE,
                       remove_punct = TRUE,
                       remove_symbols = TRUE,
                       remove_hyphens = TRUE)
Lower case the tokens:
train.tokens <- tokens_tolower(train.tokens)
Remove stopwords. We will use quanteda’s built-in stopwords list for English:
train.tokens <- tokens_select(train.tokens, stopwords(), selection = "remove")
Perform “stemming”:

In other words, we combine words that are similar and colapse them down into a single term, again for English:

train.tokens <- tokens_wordstem(train.tokens, language = "english")

Next, let’s add n-grams in order to increase the predictive power of our model and store them as matrices, as well as data frames with the corresponding frequencies for better processing and visualization. Last, we will save the data frame to the binary .RData file, to increase performance and save memory.

Unigrams:
unigram <- dfm(train.tokens, n = 1, concatenator = " ")
unigram_df <- as.data.frame(as.matrix(docfreq(unigram)))
unigram_df <- sort(rowSums(unigram_df), decreasing = TRUE)
unigram_df <- data.frame(token = names(unigram_df), frequency = unigram_df)

saveRDS(unigram_df,"unigram_df.RData")
rm(unigram, unigram_df)
Bigrams:
bigrams <- dfm(train.tokens, n = 2, concatenator = " ")
bigrams_df <- as.data.frame(as.matrix(docfreq(bigrams)))
bigrams_df <- sort(rowSums(bigrams_df), decreasing = TRUE)
bigrams_df <- data.frame(token = names(bigrams_df), frequency = bigrams_df)

saveRDS(bigrams_df,"bigrams_df.RData")
rm(bigrams, bigrams_df)
Trigrams:
trigrams <- dfm(train.tokens, n = 3, concatenator = " ")
trigrams_df <- as.data.frame(as.matrix(docfreq(trigrams)))
trigrams_df <- sort(rowSums(trigrams_df), decreasing = TRUE)
trigrams_df <- data.frame(token = names(trigrams_df), frequency = trigrams_df)

saveRDS(trigrams_df,"trigrams_df.RData")
rm(trigrams, trigrams_df)
Fourgrams:
fourgrams <- dfm(train.tokens, n = 4, concatenator = " ")
fourgrams_df <- as.data.frame(as.matrix(docfreq(fourgrams)))
fourgrams_df <- sort(rowSums(fourgrams_df), decreasing = TRUE)
fourgrams_df <- data.frame(token = names(fourgrams_df), frequency = fourgrams_df)

saveRDS(fourgrams_df,"fourgrams_df.RData")
rm(fourgrams, fourgrams_df)
Fivegrams:
fivegrams <- dfm(train.tokens, n = 5, concatenator = " ")
fivegrams_df <- as.data.frame(as.matrix(docfreq(fivegrams)))
fivegrams_df <- sort(rowSums(fivegrams_df), decreasing = TRUE)
fivegrams_df <- data.frame(token = names(fivegrams_df), frequency = fivegrams_df)

saveRDS(fivegrams_df,"fivegrams_df.RData")
rm(fivegrams, fivegrams_df)

Conclusion

Based on the n-grams’ data sets, there will be a next-word-prediction model, represented interactively through a Shiny app. In addition, there will be a presentation “pitching” the application, in terms of explaining its functionality and ways of use.