This is the report of exploratory analysis of SwiftKey data. The objective is to make observations and determine the goals for the eventual app and algorithm. I did the analysis on english files.
General file is already unzipped.
blog_us<-readLines("./final/en_US/en_US.blogs.txt", skipNul = TRUE, warn= FALSE)
news_us<-readLines("./final/en_US/en_US.news.txt", skipNul = TRUE, warn= FALSE)
twitter_us<-readLines("./final/en_US/en_US.twitter.txt", skipNul = TRUE, warn=FALSE)
library(ngram)
#1st we store counts of each file
blog_count <- length(blog_us)
news_count <- length(news_us)
twitter_count <- length(twitter_us)
#then we create a data frame to see the counts
data_set_length <-c(blog_count,news_count, twitter_count)
data_set_length <-data.frame(data_set_length)
names(data_set_length)[1] <-"Line Count"
row.names(data_set_length) <- c("Blog", "News", "Twitter")
data_set_length
## Line Count
## Blog 899288
## News 77259
## Twitter 2360148
#and now the same with words
blog_words <- wordcount(blog_us)
news_words <- wordcount(news_us)
twiter_words <- wordcount(twitter_us)
The objective is to see the words with highest frequency. Because of the size of the files, we will work on samples from files (1/1000 of each file).
f <- 1/1000
set.seed(12345)
sample_blog <- sample(blog_us, length(blog_us) * f)
sample_news <- sample(news_us, length(news_us) * f)
sample_twitter <- sample(twitter_us, length(twitter_us) * f)
combine <- paste(sample_blog, sample_news, sample_twitter, collapse = " ")
words <- unlist(strsplit(tolower(combine), "\\s+"))
table_words <- table(words)
hfw <- head(sort(table_words, decreasing = TRUE), 20)
barplot(hfw, main = "high frequency words", xlab = "words", ylab = "Frequency",las=2)
Finally, let’s look the top 10 words, and the words associated with these words.(sorry some variables are in french, it was easier for me)
mots_a_analyser <- names(hfw[1:10])
paires_mots_associés <- list()
for (mot in mots_a_analyser) {
indices <- grep(mot, words)
paires_mots <- lapply(indices, function(idx) {
mots_avant <- words[max(1, idx - 1)]
mots_apres <- words[min(length(words), idx + 1)]
return(paste(mots_avant, mot, mots_apres, sep = " "))
})
paires_mots <- unlist(paires_mots)
table_paires <- table(paires_mots)
paires_mots_frequents <- head(sort(table_paires, decreasing = TRUE), 10)
paires_mots_associés[[mot]] <- paires_mots_frequents
}
for (mot in mots_a_analyser) {
cat("words associated with", mot, ":\n")
print(paires_mots_associés[[mot]])
cat("\n")
}
## words associated with the :
## paires_mots
## of the year that the is and the are and the have in the and
## 63 38 37 36 35
## all the rest by the end in the league. when the are between the two
## 34 34 33 33 32
##
## words associated with and :
## paires_mots
## hunting and fishing the and of and and and
## 60 38 34
## to and that 2004 and 2009, activities, and final
## 32 31 31
## age and younger assists and with attitude and our
## 31 31 31
## audience. and "bully"
## 31
##
## words associated with to :
## paires_mots
## going to be need to get able to get appear to be led to a
## 50 36 33 33 33
## time to get close to the starting to get the to he want to try
## 33 32 32 32 32
##
## words associated with a :
## paires_mots
## the a of the a and to a the and a the to a a a a of and a and you a a
## 371 169 134 127 121 107 94 94
## in a and the a are
## 85 81
##
## words associated with of :
## paires_mots
## most of the one of the many of these
## 48 40 34
## as of and the of this parts of the
## 33 33 32
## "all of the amount of bad approval of a
## 31 31 31
## because of financial
## 31
##
## words associated with in :
## paires_mots
## the in of is in to be in the the in with it's in to said in a
## 97 86 73 73 65 62
## the in and of in that is in and the in in
## 53 45 41 38
##
## words associated with i :
## paires_mots
## the i of the i and is i to and i the a i of be i the the i in
## 414 163 100 94 92 82 79
## our i the the i with and i and
## 78 78 76
##
## words associated with that :
## paires_mots
## is that the fact that there
## 38 33
## 2007 that it adding that while
## 31 31
## and that a anticipate that improvement
## 31 31
## but that enrollment certify that the
## 31 31
## discovered that good hope that it
## 31 31
##
## words associated with is :
## paires_mots
## there is no that is the this is a it is an
## 41 38 38 36
## the is and it is my a is born, an is who
## 33 32 31 31
## apartment is about as is because
## 31 31
##
## words associated with for :
## paires_mots
## and for to out for a
## 37 34
## "oklahoma" for intermission 7 for tours,
## 31 31
## america for whom and for many
## 31 31
## camps for all camps for football,
## 31 31
## disrespect for america's doubtful for sunday's
## 31 31