In this report, we obtain the ‘SwiftKey’ data set and load a subset of the english dataset for further processing. The data will be cleaned by applying basic filters from the package ‘tm’ and removing profanity. Last, we build data frames of individual, pairs and triples of words and explore their distribution and frequence. We also provide a brief hint on the modeling plan at the end of the document.
First, we started by dowinloading the swiftkey dataset using the url provided on the course website and unziping the file.
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if(!file.exists("Coursera-SwiftKey.zip")){
download.file(url, "Coursera-SwiftKey.zip", method = "curl")
} else if (!dir.exists("final/")){
unzip("Coursera-SwiftKey.zip")
}
Here we have a look on the English dataset which will be used for the rest of the report.
dir("final/en_US/")
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
The following chunk of code gives a basic summary on the three files in the English dataset. These are the number of lines, characters and number of each with at least one non-white space.
require("stringi")
rbind("Blogs" = stri_stats_general(readLines("final/en_US/en_US.blogs.txt")),
"News" = stri_stats_general(readLines("final/en_US/en_US.news.txt")),
"Twitter" = stri_stats_general(readLines("final/en_US/en_US.twitter.txt")))
## Lines LinesNEmpty Chars CharsNWhite
## Blogs 899288 899288 206824382 170389539
## News 1010242 1010242 203223154 169860866
## Twitter 2360148 2360148 162096031 134082634
Second, and for further processing, we read 5000 lines of each file as a representitive subset of the data. Then we save these subsets in the directory sub/.
if(!dir.exists("sub/")){
dir.create("sub/")
write(readLines("final/en_US/en_US.blogs.txt", 5000), "sub/blogs.txt")
write(readLines("final/en_US/en_US.news.txt", 5000), "sub/news.txt")
write(readLines("final/en_US/en_US.twitter.txt", 5000), "sub/twitter.txt")
}
dir("sub/")
## [1] "blogs.txt" "news.txt" "twitter.txt"
First step in cleaning the data is applying basic filters from the package ‘tm’. These are to remove white spaces, punctuation, numbers and stop words. In addition, the function transform all letters to lower case, stem words and reads the files in ‘PlainTextDocument’ class. The function is applied after transforming the data to a ‘corpus’ and returns a corpus which will be used for the further analysis.
cleantext <- function(doc){
require(tm)
doc <- tm_map(doc, stripWhitespace) # remove white spaces
doc <- tm_map(doc, removePunctuation) # remove punctuation
doc <- tm_map(doc, removeNumbers) # remove numbers
doc <- tm_map(doc, tolower) # trun words to lower case
doc <- tm_map(doc, removeWords, stopwords("english")) # remove stop words
doc <- tm_map(doc, stemDocument) # steming
doc <- tm_map(doc, PlainTextDocument)
return(doc)
}
corpus <- cleantext( Corpus(DirSource("sub/")) )
Here, we obtain a list of profan words and apply a filter on the corpus to remove the words that match.
url <- "http://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
if(!file.exists("badwords.txt")){
download.file(url, "badwords.txt")
}
corpus <- tm_map(corpus, removeWords,
readLines("badwords.txt"))
First step in EDA is transforming the corpus into data frame. Then using ‘NGramTokenizer’ from package ‘RWeka’ to create collection of individual, pairs and trips of words. Then we tabulate the words with their frequency as they appear in the text the we order the data frames in a descending order using ‘dplyr’ package.
corpus.df <- data.frame(text=unlist(sapply(corpus,'[',"content")),stringsAsFactors=F)
TokenizersDelimiters <- "\"\'\\t\\r\\n ().,;!?"
arrangbyfreq <- function(x){
require("dplyr")
arrange(data.frame(x), desc(Freq))
}
require(RWeka)
unigram <- arrangbyfreq(table(UniTokenizer = NGramTokenizer(corpus.df, Weka_control(min = 1, max = 1))))
bigram <- arrangbyfreq(table(BiTokenizer = NGramTokenizer(corpus.df, Weka_control(min = 2, max = 2, delimiters = TokenizersDelimiters))))
trigram <- arrangbyfreq(table(TriTokenizer = NGramTokenizer(corpus.df, Weka_control(min = 3, max = 3, delimiters = TokenizersDelimiters))))
This basic histogram shows the frequency of individual words as they appear on the text. Most of the words appears very few times as we see from the skewed histogram.
hist(unigram$Freq,
breaks = 200,
lwd = 2,
main = "Distribution of Individual Words", xlab = "Indvidual Words",
col = "gray")
The following chunk of code showes a basic summary of the representitve subsets. Minimum, maximum and quantiles of numbers an individual word, pairs and triples of words appears in the text.
rbind("Individual Words" = summary(unigram$Freq),
"Pairs of Words" = summary(bigram$Freq),
"Triples of Words" = summary(trigram$Freq))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## Individual Words 1 1 1 3.422 3 304
## Pairs of Words 1 1 1 1.049 1 26
## Triples of Words 1 1 1 1.002 1 5
The following graph showes the ten most frequent individual word, pairs and triples of words in the text along with their frequency.
plotfreq <- function(x,y,z){
barplot(x[1:10,2],
names.arg=x[1:10,1],
horiz = TRUE,
las = 2,
main = y,
col = z)
}
par(mfrow = c(1,3))
plotfreq(unigram, "Individual", "blue")
plotfreq(bigram, "Pairs", "red")
plotfreq(trigram, "Trips", "yellow")
My fruther plans involves building a probablistic model of ‘NGram’ sets of the data. These to calculate the probablity of a word to come next to a single or a pair of other words. Based on this probability, the model can predict the next three words.