This report is to give a first insight into the progress of the project to create a word prediction algorithm from the three text files provided by SwiftKey.
After downloading the files, I read in the english files to RStudio to perform an initial analysis of the raw data. The questions I want to answer in this first look at the data are:
By answering this questions we can gain a better perspective on what kind of recources we have available to work out a successful prediction model
library(tm)
Loading required package: NLP
library(RWeka)
library(ggplot2)
Attaching package: 'ggplot2'
The following object is masked from 'package:NLP':
annotate
library(qdap)
Loading required package: qdapDictionaries
Loading required package: qdapRegex
Loading required package: qdapTools
Loading required package: RColorBrewer
Attaching package: 'qdap'
The following objects are masked from 'package:tm':
as.DocumentTermMatrix, as.TermDocumentMatrix
The following object is masked from 'package:base':
Filter
library(stringi)
# create a destination file for the download
swiftkey_zip <- "Coursera-SwiftKey.zip"
# download the compressed files from online source
source <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(source, swiftkey_zip)
# extract the files from the zip file
unzip(swiftkey_zip)
We will be working with the english text files first.
con1 <- file("~/Documents/Coursera/Data Science/Data Science Capstone/final/en_US/en_US.blogs.txt", "r")
en_US_blogs <- readLines(con1, encoding="UTF-8")
close(con1)
con2 <- file("~/Documents/Coursera/Data Science/Data Science Capstone/final/en_US/en_US.news.txt", "r")
en_US_news <- readLines(con1, encoding="UTF-8")
close(con2)
con3 <- file("~/Documents/Coursera/Data Science/Data Science Capstone/final/en_US/en_US.twitter.txt", "r")
en_US_twitter <- readLines(con3, encoding="UTF-8")
close(con3)
Now we can process the answers to the first three questions.
num_words_blogs <- stri_count_words(en_US_blogs)
num_words_news <- stri_count_words(en_US_news)
num_words_twitter <- stri_count_words(en_US_twitter)
file_size_blogs <- file.info("~/Documents/Coursera/Data Science/Data Science Capstone/final/en_US/en_US.blogs.txt")$size/1024^2
file_size_news <- file.info("~/Documents/Coursera/Data Science/Data Science Capstone/final/en_US/en_US.news.txt")$size/1024^2
file_size_twitter <- file.info("~/Documents/Coursera/Data Science/Data Science Capstone/final/en_US/en_US.twitter.txt")$size/1024^2
summary_table <- data.frame(filename = c("blogs","news","twitter"),
file_size_MB = c(file_size_blogs, file_size_news, file_size_twitter),
num_lines = c(length(en_US_blogs),length(en_US_news),length(en_US_twitter)),
num_words = c(sum(num_words_blogs),sum(num_words_news),sum(num_words_twitter)))
summary_table
filename file_size_MB num_lines num_words
1 blogs 200.4242 899288 37541795
2 news 196.2775 1010242 34762303
3 twitter 159.3641 2360148 30092866
As part of cleaning the data, I will create a training sample of smaller size for each file, which makes application of different tools for cleaning and analysing the data much easier and faster. From Statistical Inference we know that we only need a very small amount of data to be able to make predictions about the full data set.
set.seed(1234)
blogs_train <- sample(en_US_blogs, length(en_US_blogs)*0.01)
news_train <- sample(en_US_news, length(en_US_news)*0.01)
twitter_train <- sample(en_US_twitter, length(en_US_twitter)*0.01)
I combine these samples to one single training set for further processing.
train <- paste(blogs_train, news_train, twitter_train)
The next step is the so-called ‘Tokenization’, which means to transform the text to identify characterisitcs (tokens) like words, punctuation and numbers. For this purpose I first subdivide the text into single sentences and then create a function which removes numbers, whitespaces and punctuation, and transforms the whole text to lower case characters.
train <- sent_detect(train, language = "en", model = NULL) # subdivide into sentences
# Helper functions for tokenization and cleaning
repl_patt <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
preprocessCorpus <- function(corpus){
corpus <- tm_map(corpus, repl_patt, "/|@|\\|")
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
return(corpus)
}
# resulting text corpus
train.corpus <- VCorpus(VectorSource(train))
train.corpus <- preprocessCorpus(train.corpus)
I also want to remove profane words. I use a modified list of profane (“bad”) words for this filtering.
bad_words <- VectorSource(readLines("bad_words.txt"))
Warning in readLines("bad_words.txt"): incomplete final line found on
'bad_words.txt'
train.corpus <- tm_map(train.corpus, removeWords, bad_words)
To be able to make assertions about the quantitative properties of the data, i have to convert the now clean text data into a Matrix (data frame), that allows to count words and word combinations (n-grams). For this project we will look at Unigrams (single words), Bigrams (combinations of two words) and Trigrams (combinations of three words) and their frequencies.
# create a data frame from the clean data
training <- data.frame(text=unlist(sapply(train.corpus, `[`, "content")), stringsAsFactors=F)
# create n-grams
unigrams <- NGramTokenizer(training, Weka_control(min = 1, max = 1))
bigrams <- NGramTokenizer(training, Weka_control(min = 2, max = 2))
trigrams <- NGramTokenizer(training, Weka_control(min = 3, max = 3))
# create data frames from n-grams
uni <- data.frame(table(unigrams))
bi <- data.frame(table(bigrams))
tri <- data.frame(table(trigrams))
# sorting the resulting data frames
uni_sorted <- uni[order(uni$Freq,decreasing = TRUE),]
bi_sorted <- bi[order(bi$Freq,decreasing = TRUE),]
tri_sorted <- tri[order(tri$Freq,decreasing = TRUE),]
Now we can take a look at the most frequent words and word combinations. The following plots will show the top 20 occurences for each n-gram.
# First plot for Unigrams
uni_top20 <- uni_sorted[1:20,]
colnames(uni_top20) <- c("Word","Frequency")
ggplot(uni_top20, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="darkblue") + coord_flip() + theme(axis.title.y = element_blank()) + geom_text(aes(label=Frequency, title = "Top 20 Words"), vjust=-0.2)
# Second plot for Bigrams
bi_top20 <- bi_sorted[1:20,]
colnames(bi_top20) <- c("Word","Frequency")
ggplot(bi_top20, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="blue") + coord_flip() + theme(axis.title.y = element_blank()) + geom_text(aes(label=Frequency, title = "Top 20 Bigrams"), vjust=-0.2)
# Third plot for Trigrams
tri_top20 <- tri_sorted[1:20,]
colnames(tri_top20) <- c("Word","Frequency")
ggplot(tri_top20, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="lightblue") + coord_flip() + theme(axis.title.y = element_blank()) + geom_text(aes(label=Frequency, title = "Top 20 Trigrams"), vjust=-0.2)
For the prediction model and the subsequent application we can now take the n-grams we created and start testing different prediction algorithms based on probabilities for the next word. The most promising data will come from the Trigrams or a combination of n-grams, but it is also planned to implement methods for ‘smoothing’, like the Kneser-Ney smoothing, which can make the predictions more precise.