This report describes the preliminary analysis of the text data provided in the SwiftKey/final/en_US data files from the corpora archived at https://web-beta.archive.org/web/20160930083655/http://www.corpora.heliohost.org/aboutcorpus.html
library(tm)
library(XML)
library(SnowballC)
library(qdap)
con <- file("C:/Users/Barbara/Downloads/Coursera-SwiftKey/final/en_US/en_US.news.txt", "rb")
news <- readLines(con, skipNul = TRUE)
close(con)
con <- file("C:/Users/Barbara/Downloads/Coursera-SwiftKey/final/en_US/en_US.twitter.txt", "rb")
twitter <- readLines(con, skipNul = TRUE)
close(con)
con <- file("C:/Users/Barbara/Downloads/Coursera-SwiftKey/final/en_US/en_US.blogs.txt", "rb")
blogs <- readLines(con, skipNul = TRUE)
close(con)
The number of lines in each corpus are:
twitter: 2360148
news: 1010242
blogs: 899288
Samples of 10,000 lines were randomly selected from each file.
set.seed(123)
twit_samp <- sample(twitter, size = 10000)
set.seed(123)
blog_samp <- sample(blogs, size = 10000)
set.seed(123)
news_samp <- sample(news, size = 10000)
Includes creating a VectorSource and converting that to a volatile corpus. Numbers, whitespace, “SMART” stopwords, and punctuation were removed. The corpus was then converted to a stem document. Example of the code for the twitter corpus is shown, blog and news have similar code.
twitter_source <- VectorSource(twit_samp)
twitter_corpus <- VCorpus(twitter_source)
twitter_corpus <- tm_map(twitter_corpus, removeNumbers)
twitter_corpus <- tm_map(twitter_corpus, stripWhitespace)
twitter_corpus <- tm_map(twitter_corpus, content_transformer(tolower))
twitter_corpus <- tm_map(twitter_corpus, removeWords, stopwords("SMART"))
twitter_corpus <- tm_map(twitter_corpus, removePunctuation)
twitter_stem <- tm_map(twitter_corpus,stemDocument)
frequent_twit <- freq_terms(as.data.frame(twitter_stem), 20, stopwords = "doc")
frequent_blog <- freq_terms(as.data.frame(blog_stem), 20, stopwords = "doc")
frequent_news <- freq_terms(as.data.frame(news_stem), 20, stopwords = "doc")
The corpora extracted will be used to develop predictive modelling for text entry. It will be developed as a Shiny app to use for speeding text entry when using a mobile device.