The Data Science Specialization Capstone Project consists of analyzing a large corpus of text documents, provided by SwiftKey, in order to discover the structure of the data, and how words are put together. It covers cleaning and analyzing the text data. As a continuation of the present analysis, there will be a next-word-prediction model, represented interactively through a Shiny app.
Required packages:
library(caTools)
library(quanteda)
library(ggplot2)
Loading the text files into R. Note that the text files are placed in a “data” folder within the working directory.
blog_file <- readLines("data/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news_file <- readLines("data/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter_file <- readLines("data/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
sum(sapply(strsplit(blog_file, " "), length))
## [1] 37334131
sum(sapply(strsplit(news_file, " "), length))
## [1] 2643969
sum(sapply(strsplit(twitter_file, " "), length))
## [1] 30373583
As the information from the word count shows, the blog element has 37.3 million words, the news element - 2.6 million words, and the twitter element - 30.4 million words.
length(blog_file)
## [1] 899288
length(news_file)
## [1] 77259
length(twitter_file)
## [1] 2360148
The blog element has a total of 899,288 lines, the news element - 77,259, and the twitter element - 2.4 million lines.
summary(nchar(blog_file))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 47 156 230 329 40833
summary(nchar(news_file))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 111.0 186.0 202.4 270.0 5760.0
summary(nchar(twitter_file))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 37.00 64.00 68.68 100.00 140.00
As the summary statistics shows, the distribution of the length of the lines within each element varies from 1 to 40,833 characters in the blog element, from 2 to 5,760 characters in the news element, and from 2 to 140 characters in the twitter element.
Before we start cleaning the data, we need to subsample each text element in order to save computation power. From each element, we will extract 1% of the lines, proportionally. Later, we will furhter perform data slicing and use 70% of the lines for training and 30% for testing.
set.seed(1234)
sub_blog_file <- sample(blog_file, size = round(length(blog_file) * .01, 0), replace = TRUE)
sub_news_file <- sample(news_file, size = round(length(news_file) * .01, 0), replace = TRUE)
sub_twitter_file <- sample(twitter_file, size = round(length(twitter_file) * .01, 0), replace = TRUE)
raw_data <- c(sub_blog_file, sub_news_file, sub_twitter_file)
inTrain <- sample.split(raw_data, 0.7)
training <- subset(raw_data, inTrain == TRUE)
testing <- subset(raw_data, inTrain == FALSE)
After we have our raw training element of lines, we need to go through the text analytics data pre-processing pipeline.
train.tokens <- tokens(training, what = "word",
remove_numbers = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_hyphens = TRUE)
train.tokens <- tokens_tolower(train.tokens)
train.tokens <- tokens_select(train.tokens, stopwords(), selection = "remove")
In other words, we combine words that are similar and colapse them down into a single term, again for English:
train.tokens <- tokens_wordstem(train.tokens, language = "english")
Next, let’s add n-grams in order to increase the predictive power of our model and store them as matrices, as well as data frames with the corresponding frequencies for better processing and visualization. Last, we will save the data frame to the binary .RData file, to increase performance and save memory.
unigram <- dfm(train.tokens, n = 1, concatenator = " ")
unigram_df <- as.data.frame(as.matrix(docfreq(unigram)))
unigram_df <- sort(rowSums(unigram_df), decreasing = TRUE)
unigram_df <- data.frame(token = names(unigram_df), frequency = unigram_df)
saveRDS(unigram_df,"unigram_df.RData")
rm(unigram, unigram_df)
bigrams <- dfm(train.tokens, n = 2, concatenator = " ")
bigrams_df <- as.data.frame(as.matrix(docfreq(bigrams)))
bigrams_df <- sort(rowSums(bigrams_df), decreasing = TRUE)
bigrams_df <- data.frame(token = names(bigrams_df), frequency = bigrams_df)
saveRDS(bigrams_df,"bigrams_df.RData")
rm(bigrams, bigrams_df)
trigrams <- dfm(train.tokens, n = 3, concatenator = " ")
trigrams_df <- as.data.frame(as.matrix(docfreq(trigrams)))
trigrams_df <- sort(rowSums(trigrams_df), decreasing = TRUE)
trigrams_df <- data.frame(token = names(trigrams_df), frequency = trigrams_df)
saveRDS(trigrams_df,"trigrams_df.RData")
rm(trigrams, trigrams_df)
fourgrams <- dfm(train.tokens, n = 4, concatenator = " ")
fourgrams_df <- as.data.frame(as.matrix(docfreq(fourgrams)))
fourgrams_df <- sort(rowSums(fourgrams_df), decreasing = TRUE)
fourgrams_df <- data.frame(token = names(fourgrams_df), frequency = fourgrams_df)
saveRDS(fourgrams_df,"fourgrams_df.RData")
rm(fourgrams, fourgrams_df)
fivegrams <- dfm(train.tokens, n = 5, concatenator = " ")
fivegrams_df <- as.data.frame(as.matrix(docfreq(fivegrams)))
fivegrams_df <- sort(rowSums(fivegrams_df), decreasing = TRUE)
fivegrams_df <- data.frame(token = names(fivegrams_df), frequency = fivegrams_df)
saveRDS(fivegrams_df,"fivegrams_df.RData")
rm(fivegrams, fivegrams_df)
Based on the n-grams’ data sets, there will be a next-word-prediction model, represented interactively through a Shiny app. In addition, there will be a presentation “pitching” the application, in terms of explaining its functionality and ways of use.