This report explains my exploratory analysis on data and my goals for the app and prediction algorithm. Data was provided as part of the capstone project in the Data Science Specialization by John Hopkins University on Coursera. Analysis on data is necessary to understand its structure and being able to plan an effective predictive text model, to do this, the goal is to identify patterns in data.
Data for this project was obtained from Coursera data set link https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip and I used the US English texts.
if (!file.exists("Coursera-SwiftKey.zip")){
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
"Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip")
}
blogs <- readLines("./final/en_US/en_US.blogs.txt", encoding="UTF-8", skipNul=TRUE)
news <- readLines(file("./final/en_US/en_US.news.txt", blocking=TRUE, open="rb"), encoding="UTF-8", skipNul=TRUE)
twitter <- readLines("./final/en_US/en_US.twitter.txt", encoding="UTF-8", skipNul=TRUE)
For each file used, I obtain basic information: its size, number of lines, number of words, longest line.
info <- cbind.data.frame(c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"),
c(round(file.info("./final/en_US/en_US.blogs.txt")$size/1024/1024,1),
round(file.info("./final/en_US/en_US.news.txt")$size/1024/1024,1),
round(file.info("./final/en_US/en_US.twitter.txt")$size/1024/1024,1)),
c(length(blogs), length(news), length(twitter)),
c(sum(sapply(strsplit(blogs,"\\s+"),length)),
sum(sapply(strsplit(news,"\\s+"),length)),
sum(sapply(strsplit(twitter,"\\s+"),length))),
c(max(nchar(blogs)), max(nchar(news)), max(nchar(twitter))))
names(info) <- c("File", "Size (MB)", "# Lines", "# Words", "Longest Line Length")
info
## File Size (MB) # Lines # Words Longest Line Length
## 1 en_US.blogs.txt 200.4 100 4704 1461
## 2 en_US.news.txt 196.3 100 3222 982
## 3 en_US.twitter.txt 159.4 100 1275 140
Due to excessive amount of memory required to store and further process all data, and restrictions with the computer I was working, I took a sample of only 1000 items of each data set.
set.seed(777)
sample.size.factor <- 1000
data <- c(sample(blogs, sample.size.factor),
sample(news, sample.size.factor),
sample(twitter, sample.size.factor))
length(data)
## [1] 3000
object.size(data)
## 689752 bytes
It was necessary to perform the following cleaning tasks using tm package: eliminate punctuation symbols and numbers, convert all letters to lowercase, and strip whitespaces. Due to non UTF-8 chars in original texts, it was necessary to use iconv function to change encodings. Finally, profanity words were obtained from www.freewebheaders.com.
These cleaning tasks were necessary to prepare data for the exploratory data analysis. I decided not to clean stop words, because I want my prediction algorithm to use them.
require(tm)
## Loading required package: tm
## Loading required package: NLP
# Clean non UTF-8
data <- iconv(data, from="UTF-8", to="ASCII", sub="")
# Clean transformations
corpus <- VCorpus(VectorSource(data))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
# Clean banned words
if (!file.exists("bannedwords.zip"))
{
download.file("http://www.freewebheaders.com/wordpress/wp-content/uploads/full-list-of-bad-words-banned-by-google-txt-file.zip", "bannedwords.zip")
unzip("bannedwords.zip")
}
banned <- readLines(file("full-list-of-bad-words-banned-by-google-txt-file_2013_11_26_04_53_31_867.txt", blocking=TRUE, open="rb"))
corpus <- tm_map(corpus, removeWords, banned)
corpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3000
An n-gram is a contiguous sequence of n items from a given sequence of text or speech. I used RWeka package to perform frequency analyses of 1-gram, 2-gram, and 3-gram.
require(RWeka)
## Loading required package: RWeka
require(ggplot2)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
require(data.table)
## Loading required package: data.table
# Make the 1-gram
Tokenizer1Gram <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
uni.gram <- DocumentTermMatrix(corpus, control = list(tokenize = Tokenizer1Gram))
uni.gram
## <<DocumentTermMatrix (documents: 3000, terms: 13781)>>
## Non-/sparse entries: 59073/41283927
## Sparsity : 100%
## Maximal term length: 31
## Weighting : term frequency (tf)
# Plot the top 20 1-gram
freq1 <- sort(colSums(as.matrix(uni.gram)), decreasing=TRUE)
head(freq1, 20)
## the and that for you was with have this are but not from his its
## 4420 2198 929 899 675 605 602 520 488 425 421 392 331 314 310
## they said one all will
## 299 285 267 260 257
datafreq1 <- as.data.frame(data.table(word=names(freq1), freq=freq1))
ggplot(head(datafreq1, 20), aes(x=reorder(word, freq), y=freq)) +
geom_bar(stat="identity", fill="#00AA00") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
xlab("1-gram") +
ylab("Frequency") +
ggtitle("Top 20 1-grams") +
coord_flip()
# Make the 2-gram
Tokenizer2Gram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bi.gram <- DocumentTermMatrix(corpus, control = list(tokenize = Tokenizer2Gram))
bi.gram
## <<DocumentTermMatrix (documents: 3000, terms: 57672)>>
## Non-/sparse entries: 82098/172933902
## Sparsity : 100%
## Maximal term length: 38
## Weighting : term frequency (tf)
# Plot the top 20 2-gram
freq2 <- sort(colSums(as.matrix(bi.gram)), decreasing=TRUE)
head(freq2, 20)
## of the in the on the to the for the at the to be and the
## 433 419 214 192 145 131 124 121
## in a it was from the is a will be of a i have it is
## 108 100 91 88 85 78 76 75
## with a and i i was with the
## 75 74 74 74
datafreq2 <- as.data.frame(data.table(word=names(freq2), freq=freq2))
ggplot(head(datafreq2, 20), aes(x=reorder(word, freq), y=freq)) +
geom_bar(stat="identity", fill="#0077AA") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
xlab("2-gram") +
ylab("Frequency") +
ggtitle("Top 20 2-grams") +
coord_flip()
# Make the 3-gram
Tokenizer3Gram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tri.gram <- DocumentTermMatrix(corpus, control = list(tokenize = Tokenizer3Gram))
tri.gram
## <<DocumentTermMatrix (documents: 3000, terms: 76963)>>
## Non-/sparse entries: 80646/230808354
## Sparsity : 100%
## Maximal term length: 45
## Weighting : term frequency (tf)
# Plot the top 20 3-gram
freq3 <- sort(colSums(as.matrix(tri.gram)), decreasing=TRUE)
head(freq3, 20)
## one of the a lot of be able to the end of there is a
## 34 25 18 17 16
## to be a some of the out of the i have to as well as
## 16 15 14 13 12
## going to be im going to is going to look look look the fact that
## 12 12 12 12 12
## at the end i dont know in the first it was a one of those
## 11 11 11 11 11
datafreq3 <- as.data.frame(data.table(word=names(freq3), freq=freq3))
ggplot(head(datafreq3, 20), aes(x=reorder(word, freq), y=freq)) +
geom_bar(stat="identity", fill="#AA77FF") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
xlab("3-gram") +
ylab("Frequency") +
ggtitle("Top 20 3-grams") +
coord_flip()
Whit this analysis and findings I will work on the development of the algorithm and Shiny app. The general idea is that for any given input (n-tokens) I will try to find a suitable [n+1]-gram to predict the next word. However, as n increases, there are less suitables ocurrences. The application will be implemented as a Shiny app that allows the user to enter a phrase and the app will suggest the most likely next word.