The goal of this project is just to display the ability to work with and explore the data, which will eventually lead up to creating a prediction algorithm and creating a data product.
The motivation for this project is to:
1. Demonstrate that data have been successfully loaded.
2. Create a basic report of summary statistics about the data sets.
3. Report any interesting findings amassed so far.
4. Get feedback on plans for creating a prediction algorithm and Shiny app.
# Chosen packages, which may be of utility for the analysis:
library(dplyr); library(ggplot2); library(tibble); library(stringr); library(stringi); library(tm); library(SnowballC); library(RWeka);
library(RWekajars); library(quanteda)
Raw data, stored as text files, were downloaded to local drive from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip (accessed 18 March 2022), and then loaded and read in R for further analysis.
Full data set includes blogs, tweets, and news written in English, Finnish, German, and Russian. For the purposes of this project, just the English text will be analyzed.
blogs <- file('capstone/en_US/en_US.blogs.txt', 'r')
blogLines <- readLines(blogs, encoding = 'UTF-8', skipNul = T, warn = F)
close(blogs)
tweets <- file('capstone/en_US/en_US.twitter.txt', 'r')
tweetLines <- readLines(tweets, encoding = 'UTF-8', skipNul = T, warn = F)
close(tweets)
news <- file('capstone/en_US/en_US.news.txt', 'r')
newsLines <- readLines(news, encoding = 'UTF-8', skipNul = T, warn = F)
close(news)
For each source (blogs, tweets, news), its file size (in megabytes, MB), count of lines, word count, and character count are summarized in a simple table.
table <- function(blogLines, tweetLines, newsLines) {
features <- data.frame(source = c('blogs', 'tweets', 'news'),
lineCount = c(length(blogLines),
length(tweetLines),
length(newsLines)),
words = c(sum(stri_count_words(blogLines)),
sum(stri_count_words(tweetLines)),
sum(stri_count_words(newsLines))),
characters = c(stri_stats_general(blogLines)[3],
stri_stats_general(tweetLines)[3],
stri_stats_general(newsLines)[3]))
features
}
table(blogLines, tweetLines, newsLines)
## source lineCount words characters
## 1 blogs 899288 37546250 206824382
## 2 tweets 2360148 30093413 162096241
## 3 news 77259 2674536 15639408
Taking into consideration the sheer size of the data set, 20,000 lines of text from each of the three sources are sampled and then combined to form a corpus (with a total 60,000 lines of text).
set.seed(20220322)
sampling <- c(sample(blogLines, 20000), sample(tweetLines, 20000),
sample(newsLines, 20000))
length(sampling) # Should come up with 60000 as desired
## [1] 60000
stri_stats_general(sampling) # to call general statics
## Lines LinesNEmpty Chars CharsNWhite
## 60000 60000 9968110 8264771
stri_stats_latex(sampling)
## CharsWord CharsCmdEnvir CharsWhite Words Cmds
## 7867964 24 2045495 1772188 7
## Envirs
## 0
This sampling of the three sources brings just under 10 million characters and over 1.77 million words.
This new corpus of cleaned sampled data will be named textColl, as in text collection.Also, note that each line of code has rationale (the comments) provided.
textColl <- VCorpus(VectorSource(sampling))
spacing <- content_transformer(function(x, pattern) {
return(gsub(pattern, " ", x))
})
textColl <- tm_map(textColl, removePunctuation) # To remove punctuations
textColl <- tm_map(textColl, content_transformer(tolower)) # To convert all letters to lowercase for uniformity
textColl <- tm_map(textColl, stemDocument) # This process of stemming reduces, say, gerunds and past tenses to their stems, or root words. (e.g. 'stemming' and 'stemmed' are stemmed to just 'stem.')
textColl <- tm_map(textColl, removeNumbers) # To remove numerals
removeURL <- function(u) {
stri_replace_all_regex(u, "(ht|f)tp\\S+\\s*", " ")
stri_replace_all_regex(u, "www\\S+\\s*", " ")
}
textColl <- tm_map(textColl, content_transformer(removeURL)) # To remove instances of uniform resource locators, a la website addresses
removeHash <- function(h) {
stri_replace_all_regex(h, "\\S+#\\S+", " ")
}
textColl <- tm_map(textColl, content_transformer(removeHash)) # To remove twitter hashtags
textColl <- tm_map(textColl, removeWords, letters) # To remove any stray letters - may take a moment
textColl <- tm_map(textColl, stripWhitespace) # To remove any extraneous white space.
Part of the whole exercise is to build what are called n-grams to explore single word and word pair frequencies. Specifically, these would be what are called unigrams (1-gram or single word), bigrams (2-gram or a pair of words), and trigrams (3-gram or a string of three words), which are created by ‘tokenizing’ or extracting these n-gram patterns from the ‘textColl’ corpus. An explanation can be found at http://en.wikipedia.org/wiki/N-gram.
# Note: Each tokenizing process may take a moment.
# For unigrams...
uniToken <- function(x) {
NGramTokenizer(x, Weka_control(min = 1, max = 1))
}
unigrams <- DocumentTermMatrix(textColl, control = list(tokenize = uniToken))
unigrams <- removeSparseTerms(unigrams, 0.999)
# For bigrams...
biToken <- function(x) {
NGramTokenizer(x, Weka_control(min = 2, max = 2))
}
bigrams <- DocumentTermMatrix(textColl, control = list(tokenize = biToken))
bigrams <- removeSparseTerms(bigrams, 0.999)
# For trigrams...
triToken <- function(x) {
NGramTokenizer(x, Weka_control(min = 3, max = 3))
}
trigrams <- DocumentTermMatrix(textColl, control = list(tokenize = triToken))
trigrams <- removeSparseTerms(trigrams, 0.999)
# Will now measure n-grams' frequencies. Will also extract the top 25 from each word pattern.
freqs <- function(d) {
sort(colSums(as.matrix(d)), decreasing = T)
}
unifreqs <- freqs(unigrams)
unifreqs25 <- unifreqs[1:25] # Extracting the top 25
unifreqs25df <- data.frame(word = names(unifreqs25),
frequency = as.numeric(unifreqs25)
)
bifreqs <- freqs(bigrams)
bifreqs25 <- bifreqs[1:25] # Extracting the top 25
bifreqs25df <- data.frame(bigram = names(bifreqs25),
frequency = as.numeric(bifreqs25)
)
trifreqs <- freqs(trigrams)
trifreqs25 <- trifreqs[1:25] # Extracting the top 25
trifreqs25df <- data.frame(trigram = names(trifreqs25),
frequency = as.numeric(trifreqs25)
)
trifreqs50 <- trifreqs[26:50] # Extracting the next 25 trigram frequencies for curiosity
trifreqs50df <- data.frame(trigram50 = names(trifreqs50),
frequency = as.numeric(trifreqs50)
)
Extracting trigrams frequencies was an exercise in trial and error. After some attempts at stemming and removing stop words, the unigram and bigram frequency extractions executed seamlessly. When it was time to extract trigram frequencies, however, the results came up empty. So, after belaboring the trigram frequency extraction a bit further, I decided to settle on just leaving the stop words. As a result, I got the trigrams frequencies I had hoped for.
From some of the literature I read, whether or not to stem or to remove stop words would all boil down to whatever works to achieve optimal results as best as possible. There are pros and cons to applying (or foregoing) stemming or stop word removal. For example, stop words may be essential for phrase completeness. Or stemming may unintentionally truncate an actual stem word. A case in point for this: Notice the phrases accord to the and be abl to in the trigram bar plot.
May re-explore ways to handle the stemming issue, because a cursory look of the top-25 trigrams, for example, reveals unintentional truncation of some of the words.
The quality of trigrams, in terms of their syntax, look promising. Will take the tokenizing process a bit further by increasing n-grams to four words.
Will then create a predictive algorithm based on the n-gram model.
From that predictive algorithm, will develop a data product (an app) that would predict words that are likely to follow a user’s manual entry.