In this report I explain my exploratory analysis and my goals for the final app and algorithm. This document briefly summarize plans for creating the prediction algorithm and Shiny app.
Data used in this document was downloaded January 02, 2018 from this site:
if (!file.exists("./data")) {
dir.create("./data")
}
link <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if (!file.exists("./data/Coursera-Swiftkey.zip")) {
download.file(link,"./data/Coursera-Swiftkey.zip")
}
Libraries used:
library(readr)
library(ngram)
library(NLP)
library(tm)
library(ggplot2)
library(gridExtra)
library(wordcloud)
library(RWeka)
Data was peocessed with:
First I decompress the zipped file:
unzip("./data/Coursera-Swiftkey.zip")
then loaded in separate files for analysis:
blogs_full <- read_lines("./final/en_US/en_US.blogs.txt", progress = FALSE)
news_full <- read_lines("./final/en_US/en_US.news.txt", progress = FALSE)
twitter_full <- read_lines("./final/en_US/en_US.twitter.txt", progress = FALSE)
Some basic stats:
df_stat <- data.frame(FileName = "en_US.blogs.txt", NumRow = NROW(blogs_full), NumChars = sum(nchar(blogs_full)))
df_stat <- rbind(df_stat, data.frame(FileName = "en_US.news.txt", NumRow = NROW(news_full), NumChars = sum(nchar(news_full))))
df_stat <- rbind(df_stat, data.frame(FileName = "en_US.twitter.txt", NumRow = NROW(twitter_full), NumChars = sum(nchar(twitter_full))))
knitr::kable(df_stat, row.names = FALSE, format.args = list(big.mark = ','))
| FileName | NumRow | NumChars |
|---|---|---|
| en_US.blogs.txt | 899,288 | 206,824,505 |
| en_US.news.txt | 1,010,242 | 203,223,159 |
| en_US.twitter.txt | 2,360,148 | 162,096,031 |
Files are huge enough and can be sampled to reduce elaboration time. I sample 1% of every file:
set.seed(27182)
blogs <- sample(blogs_full, 0.01 * NROW(blogs_full))
news <- sample(news_full, 0.01 * NROW(news_full))
twitter <- sample(twitter_full, 0.01 * NROW(twitter_full))
df_stat_sample <- data.frame(Data = "blogs", NumRow = NROW(blogs), NumWords = wordcount(blogs), NumChars = sum(nchar(blogs)))
df_stat_sample <- rbind(df_stat_sample, data.frame(Data = "news", NumRow = NROW(news), NumWords = wordcount(news), NumChars = sum(nchar(news))))
df_stat_sample <- rbind(df_stat_sample, data.frame(Data = "twitter", NumRow = NROW(twitter), NumWords = wordcount(twitter), NumChars = sum(nchar(twitter))))
knitr::kable(df_stat_sample, row.names = FALSE, format.args = list(big.mark = ','))
| Data | NumRow | NumWords | NumChars |
|---|---|---|---|
| blogs | 8,992 | 372,961 | 2,063,969 |
| news | 10,102 | 343,461 | 2,032,212 |
| 23,601 | 303,961 | 1,621,456 | |
| With these | samples | the report | is completely processed within 1 hour. |
Now I clean datasets and also prepare the Corpora.
# to avoid repetition I create a function
cleanData <- function(x) {
x <- iconv(x, from = "UTF-8", to = "ASCII", sub = "") # convert needed to avoid error due weird characters
x <- VCorpus(VectorSource(x)) # convert
x <- tm_map(x, tolower) # all lowercase
x <- tm_map(x, removeNumbers) # no numbers
x <- tm_map(x, removePunctuation) # no punctuation
x <- tm_map(x, stripWhitespace) # no white spaces
x <- tm_map(x, PlainTextDocument) # plain text only for NLP processing
x <- tm_map(x, removeWords, stopwords(kind = "en"))
}
blogs_clean <- cleanData(blogs)
news_clean <- cleanData(news)
twitter_clean <- cleanData(twitter)
Now I analyze the cleaned dataset/corpora. First create a Document-Term Matrices for analyses:
# define various steps in one function to repeat for every corpus
createDocumentTermMatrix <- function(x) {
# create Document-Term Matrix
documentTermMatrix <- DocumentTermMatrix(x, control = list(weighting = function(z) {weightTf(z)}))
# freq <- sort(colSums(as.matrix(documentTermMatrix)), decreasing=TRUE)
return(documentTermMatrix)
}
blogsDTM <- createDocumentTermMatrix(blogs_clean)
newsDTM <- createDocumentTermMatrix(news_clean)
twitterDTM <- createDocumentTermMatrix(twitter_clean)
Now some statitics on Document-Term Matrices:
blogsDTM
## <<DocumentTermMatrix (documents: 8992, terms: 28758)>>
## Non-/sparse entries: 171156/258420780
## Sparsity : 100%
## Maximal term length: 84
## Weighting : term frequency (tf)
newsDTM
## <<DocumentTermMatrix (documents: 10102, terms: 30403)>>
## Non-/sparse entries: 181704/306949402
## Sparsity : 100%
## Maximal term length: 35
## Weighting : term frequency (tf)
twitterDTM
## <<DocumentTermMatrix (documents: 23601, terms: 25595)>>
## Non-/sparse entries: 159930/603907665
## Sparsity : 100%
## Maximal term length: 40
## Weighting : term frequency (tf)
And then some statitics about word frequencies:
freqBlogs <- sort(colSums(as.matrix(blogsDTM)), decreasing=TRUE)
head(freqBlogs, 10)
## one will just can like time get now people also
## 1221 1107 1031 956 949 888 721 632 600 565
freqNews <- sort(colSums(as.matrix(newsDTM)), decreasing=TRUE)
head(freqNews, 10)
## said will one new year two can also first just
## 2416 1056 870 709 610 591 579 573 547 528
freqTwitter <- sort(colSums(as.matrix(twitterDTM)), decreasing=TRUE)
head(freqTwitter, 10)
## just like get love good will dont can day thanks
## 1577 1233 1117 1035 979 913 899 888 874 855
We see that in the first ten words of every corpus there are several common terms.
Now some histograms of word frequency can represent better this fact:
plotTopFreq <- function(x, corp) {
inputData <- sort(x, decreasing = T)[1:25] # Top 25 words
df <- data.frame(x=names(inputData), y=inputData) # data frame for ggplot
p <- ggplot(data=df, aes(x=reorder(x, -y), y=y)) +
xlab(paste("Top 25 words -", corp)) + ylab("Count") +
geom_bar(stat = "identity") +
theme_light() + theme(axis.text.x = element_text(angle = 60))
p
}
pb <- plotTopFreq(freqBlogs, "Blogs")
pn <- plotTopFreq(freqNews, "News")
ptw <- plotTopFreq(freqTwitter, "Twitter")
grid.arrange(pb, pn, ptw, ncol=1)
and some word clouds:
wordcloud(names(freqBlogs), freqBlogs, max.words = 50)
wordcloud(names(freqNews), freqNews, max.words = 50)
wordcloud(names(freqTwitter), freqTwitter, max.words = 50)
Now, similar to words (unigram) statistics, I calculate Top 25 frequencies of bigram and print first ten:
# function to create N-Gram frequencies
createNFreq <- function(x, n = 2, lowfreq = 10){
# Function to find n-grams (from tm FAQ)
myNGramTokenizer <- function(x){
NGramTokenizer(x, Weka_control(min = n, max = n))
}
ngramXTDM <- TermDocumentMatrix(x, control = list(tokenize = myNGramTokenizer))
freqNgramX <- rowSums(as.matrix(ngramXTDM[findFreqTerms(ngramXTDM, lowfreq = lowfreq), ]))
return(freqNgramX)
}
bigramTopFreqBlogs <- sort(createNFreq(blogs_clean), decreasing = T)
bigramTopFreqNews <- sort(createNFreq(news_clean), decreasing = T)
bigramTopFreqTwitter <- sort(createNFreq(twitter_clean), decreasing = T)
# Print Top 10
head(bigramTopFreqBlogs, 10)
## dont know even though im sure years ago first time feel like
## 69 50 50 50 48 47
## right now can see new york dont think
## 46 44 42 38
head(bigramTopFreqNews, 10)
## last year new york st louis high school years ago los angeles
## 145 109 92 91 75 66
## new jersey last week first time dont know
## 66 60 50 48
head(bigramTopFreqTwitter, 10)
## cant wait right now last night looking forward
## 175 165 122 102
## dont know happy birthday just got im going
## 88 84 84 81
## feel like good morning
## 80 76
Then I can plot histograms:
pb <- plotTopFreq(bigramTopFreqBlogs, "Blogs")
pn <- plotTopFreq(bigramTopFreqNews, "News")
ptw <- plotTopFreq(bigramTopFreqTwitter, "Twitter")
print(pb)
print(pn)
print(ptw)
For tri-grams I proceed similarly to the previous case:
# here I use the same function as 2-gram but changing default values and with a lower lowfreq parameter
trigramTopFreqBlogs <- sort(createNFreq(blogs_clean, n = 3, lowfreq = 5), decreasing = T)
trigramTopFreqNews <- sort(createNFreq(news_clean, n = 3, lowfreq = 5), decreasing = T)
trigramTopFreqTwitter <- sort(createNFreq(twitter_clean, n = 3, lowfreq = 5), decreasing = T)
# Print Top 10
head(trigramTopFreqBlogs, 10)
## accounting jobs italy new york city ventimiglia italy hotels
## 20 14 9
## amazon services llc italy accounting jobs dont get wrong
## 8 7 6
## im pretty sure jobs italy accounting new york ny
## 6 6 6
## two years ago
## 6
head(trigramTopFreqNews, 10)
## st louis county president barack obama four years ago
## 15 14 11
## gov chris christie world war ii first time since
## 11 10 9
## new york city cents per share high school basketball
## 9 8 7
## two weeks ago
## 7
head(trigramTopFreqTwitter, 10)
## cant wait see happy mothers day let us know
## 37 34 18
## la la la happy valentines day cinco de mayo
## 13 12 11
## im pretty sure just got back cant wait get
## 11 10 9
## cant wait hear
## 9
Then I can plot histograms:
pb <- plotTopFreq(trigramTopFreqBlogs, "Blogs")
pn <- plotTopFreq(trigramTopFreqNews, "News")
ptw <- plotTopFreq(trigramTopFreqTwitter, "Twitter")
print(pb)
## Warning: Removed 13 rows containing missing values (position_stack).
print(pn)
print(ptw)
As seen from plots and most frequently used word tables, the transition from 1-gram to 2-gram and 3-gram creates an ever-narrower set of values thus making prediction more effective.
The algorithm currently uses a lot of processing time, but we have seen that the data are very scattered, so by resurrecting this effect the algorithm should become more efficient.
For now, the Katz’s back-off model seems to be the most promising in terms of prediction but only later I will be able to say if it will be efficient enough to be applied.