Based on the text dataset provide by SwiftKey https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip, this milestone report delivers solutions to the following tasks:
This milestone report deals exclusively with the English corpus, however, the techniques implemented here will work for the German, Finish, and Russian corpuses.
The purpose of this Milestone Report is to demonstrate the ability to mine and analyze text data to discover interesting patterns, extract useful knowledge, and support prediction… to start this process you need to get your hands on the data!
## Datafile from coursera website
file <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
## If file does not exist, download it and unzip it
if (!file.exists("Coursera-SwiftKey.zip")) {
download.file(file, destfile="Coursera-SwiftKey.zip", method = "curl")
}
unzip("Coursera-SwiftKey.zip")
The next step in the process is to load in the data. The three datasets (blogs, news, tweets) are sampled: 33333 lines from each file. This should enable reasonably representative early exploratory analysis. Prior to analysis or modeling, the data requires cleaning (remove punctuation, numbers, stopwords, profanity, etc.) and tokenization (separate strings into individual words N-grams).
rm(list = ls())
## Read in data
blogs <- readLines("../Two/final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE) %>%
as_tibble() %>%
mutate(num_char = nchar(value))
news <- readLines("../Two/final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE, warn=FALSE) %>%
as_tibble() %>%
mutate(num_char = nchar(value))
tweets <- readLines("../Two/final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE) %>%
as_tibble() %>%
mutate(num_char = nchar(value))
## Sample data
blogs_samp <- sample_n(blogs, 33333)
news_samp <- sample_n(news, 33333)
tweets_samp <- sample_n(tweets, 33333)
textmine <- rbind(blogs_samp, news_samp, tweets_samp)
## Datafile from Carnegie Mellon University School of Computer Science website
file <- "https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
## If file does not exist, download
if (!file.exists("../Two/final/en_US/bad-words.txt")) {
download.file(file, destfile="../Two/final/en_US/bad-words.txt", method = "auto")
}
## Remove bad words
curses <- readLines("../Two/final/en_US/bad-words.txt")
curses = curses[-1] #empty row
curses <- data.frame(word = curses, lexicon = "PROFANE") # used to be stop words!!!
## Tokenize and filter Ngram data
data("stop_words")
colnames(textmine) <- "text"
unigrams <- textmine %>%
unnest_tokens(unigram, text, token = "ngrams", n = 1) %>%
count(unigram, sort = TRUE) %>%
separate(unigram, c("word1"), sep = " ") %>%
filter(!word1 %in% stop_words$word,
!word1 %in% curses$word,
!str_detect(word1, "^\\d+"),
!str_detect(word1, "[:digit:]"),
!stri_detect_regex(word1,"^[:punct:]")) %>%
mutate(total = sum(n))
The above code was implemented for both bi-, tri-, and quad-grams but is omiited/hidden here for readability.
The first characterization of the data is to create a table describing the size (memory) and number of lines of each dataset (blogs, news, tweets, and textmine = sampled data). Next bar charts of uni-, bi-, tri-, and quadgrams obtained from the sampled data are displayed.
## Tablulize data
Name <- c("blogs","news","tweets", "textmine")
Size_mb <- c( object_size(blogs), object_size(news), object_size(tweets), object_size(textmine))/1000000
Lines <- c( dim(blogs)[1], dim(news)[1], dim(tweets)[1], dim(textmine)[1])
rawtext_table <- data.frame(Name,Size_mb,Lines)
knitr::kable(rawtext_table)
Name | Size_mb | Lines |
---|---|---|
blogs | 264.1623 | 899288 |
news | 20.4213 | 77259 |
tweets | 325.4791 | 2360148 |
textmine | 23.1800 | 99999 |
## Visualize data
unigram <- within(unigrams, rm(total))
unigram %>%
top_n(20) %>%
mutate(word = reorder(word1, n)) %>%
ggplot(aes(word, n)) +
geom_bar(stat = "identity", fill="red", colour="black") +
xlab(NULL) +
coord_flip()+
ggtitle("Most frequent words in textmine corpus")
## Selecting by n
Again, the above code was implemented for both bi-, tri-, and quad-grams but is omiited/hidden here for readability.
## Selecting by n
## Selecting by n
## Selecting by n
visualizeWordcloud <- function(term, freq, title = "", min.freq = 50, max.words = 200){
mypal <- brewer.pal(8,"Dark2")
wordcloud(words = term,
freq = freq,
colors = mypal,
scale=c(4,.1),
rot.per=.15,
min.freq = min.freq, max.words = max.words,
random.order = FALSE)
}
#par(mfrow = c(1, 2))
visualizeWordcloud(term = unigram$word1, freq = unigram$n)
#visualizeWordcloud(term = bigram$word3, freq = unigram$n)
#par(mfrow = c(1, 2))
#visualizeWordcloud(term = trigram$word4, freq = unigram$n)
#visualizeWordcloud(term = quadgram$word5, freq = unigram$n)
Finally, a wordcloud is used to provide intuitive insight.
The code and analysis above is the beginning of building a text prediction App based on N-grams (currently 1, 2, 3, or 4 words). The model will work as follows: The App will check if the inputted text is equivalent to a known N-gram (i.e., previously learned in the textmine corpus), then predict the most appropriate/frequent word.
The following points below will also need to be addressed for implementation.
Finally the above analysis is based upon the removal of stopwords… The App should incorporate these as they are acutally the most common link words used in language.