Introduction

This is the report of exploratory analysis of SwiftKey data. The objective is to make observations and determine the goals for the eventual app and algorithm. I did the analysis on english files.

Files reading

General file is already unzipped.

blog_us<-readLines("./final/en_US/en_US.blogs.txt", skipNul = TRUE, warn= FALSE)
news_us<-readLines("./final/en_US/en_US.news.txt", skipNul = TRUE, warn= FALSE)
twitter_us<-readLines("./final/en_US/en_US.twitter.txt", skipNul = TRUE, warn=FALSE)
library(ngram)

Lines and word counts

#1st we store counts of each file
blog_count <- length(blog_us)
news_count <- length(news_us)
twitter_count <- length(twitter_us)
#then we create a data frame to see the counts 
data_set_length <-c(blog_count,news_count, twitter_count)
data_set_length <-data.frame(data_set_length)
names(data_set_length)[1] <-"Line Count"
row.names(data_set_length) <- c("Blog", "News", "Twitter")
data_set_length
##         Line Count
## Blog        899288
## News         77259
## Twitter    2360148
#and now the same with words
blog_words <- wordcount(blog_us)
news_words <- wordcount(news_us)
twiter_words <- wordcount(twitter_us)

Words analysis

Barplot for top20 words (frequency)

The objective is to see the words with highest frequency. Because of the size of the files, we will work on samples from files (1/1000 of each file).

f <- 1/1000
set.seed(12345)
sample_blog <- sample(blog_us, length(blog_us) * f)
sample_news <- sample(news_us, length(news_us) * f)
sample_twitter <- sample(twitter_us, length(twitter_us) * f)
combine <- paste(sample_blog, sample_news, sample_twitter, collapse = " ")
words <- unlist(strsplit(tolower(combine), "\\s+"))
table_words <- table(words)
hfw <- head(sort(table_words, decreasing = TRUE), 20)
barplot(hfw, main = "high frequency words", xlab = "words", ylab = "Frequency",las=2)

peers of words

Finally, let’s look the top 10 words, and the words associated with these words.(sorry some variables are in french, it was easier for me)

mots_a_analyser <- names(hfw[1:10])
paires_mots_associés <- list()

for (mot in mots_a_analyser) {
  indices <- grep(mot, words)
  paires_mots <- lapply(indices, function(idx) {
    mots_avant <- words[max(1, idx - 1)]
    mots_apres <- words[min(length(words), idx + 1)]
    return(paste(mots_avant, mot, mots_apres, sep = " "))
  })
  paires_mots <- unlist(paires_mots)
  table_paires <- table(paires_mots)
  paires_mots_frequents <- head(sort(table_paires, decreasing = TRUE), 10)
  paires_mots_associés[[mot]] <- paires_mots_frequents
}
for (mot in mots_a_analyser) {
  cat("words associated with", mot, ":\n")
  print(paires_mots_associés[[mot]])
  cat("\n")
}
## words associated with the :
## paires_mots
##     of the year     that the is     and the are    and the have      in the and 
##              63              38              37              36              35 
##    all the rest      by the end  in the league.    when the are between the two 
##              34              34              33              33              32 
## 
## words associated with and :
## paires_mots
##   hunting and fishing            the and of           and and and 
##                    60                    38                    34 
##           to and that        2004 and 2009, activities, and final 
##                    32                    31                    31 
##       age and younger      assists and with      attitude and our 
##                    31                    31                    31 
## audience. and "bully" 
##                    31 
## 
## words associated with to :
## paires_mots
##     going to be     need to get     able to get    appear to be        led to a 
##              50              36              33              33              33 
##     time to get    close to the starting to get       the to he     want to try 
##              33              32              32              32              32 
## 
## words associated with a :
## paires_mots
##  the a of the a and  to a the and a the    to a a    a a of and a and   you a a 
##       371       169       134       127       121       107        94        94 
##  in a and the a are 
##        85        81 
## 
## words associated with of :
## paires_mots
##          most of the           one of the        many of these 
##                   48                   40                   34 
##            as of and          the of this         parts of the 
##                   33                   33                   32 
##          "all of the        amount of bad        approval of a 
##                   31                   31                   31 
## because of financial 
##                   31 
## 
## words associated with in :
## paires_mots
##   the in of    is in to   be in the the in with  it's in to   said in a 
##          97          86          73          73          65          62 
##  the in and  of in that   is in and   the in in 
##          53          45          41          38 
## 
## words associated with i :
## paires_mots
##   the i of  the i and    is i to  and i the     a i of   be i the   the i in 
##        414        163        100         94         92         82         79 
##  our i the the i with  and i and 
##         78         78         76 
## 
## words associated with that :
## paires_mots
##                 is that the             fact that there 
##                          38                          33 
##                2007 that it           adding that while 
##                          31                          31 
##                  and that a anticipate that improvement 
##                          31                          31 
##         but that enrollment            certify that the 
##                          31                          31 
##        discovered that good                hope that it 
##                          31                          31 
## 
## words associated with is :
## paires_mots
##        there is no        that is the          this is a           it is an 
##                 41                 38                 38                 36 
##         the is and           it is my         a is born,          an is who 
##                 33                 32                 31                 31 
## apartment is about      as is because 
##                 31                 31 
## 
## words associated with for :
## paires_mots
##                  and for to                   out for a 
##                          37                          34 
## "oklahoma" for intermission                7 for tours, 
##                          31                          31 
##            america for whom                and for many 
##                          31                          31 
##               camps for all         camps for football, 
##                          31                          31 
##    disrespect for america's       doubtful for sunday's 
##                          31                          31