Project Milestone for DS Capstone

This report describes the preliminary analysis of the text data provided in the SwiftKey/final/en_US data files from the corpora archived at https://web-beta.archive.org/web/20160930083655/http://www.corpora.heliohost.org/aboutcorpus.html

Packages for downloading and pre-processing the files:

library(tm)
library(XML)
library(SnowballC)
library(qdap)

Download the files

con <- file("C:/Users/Barbara/Downloads/Coursera-SwiftKey/final/en_US/en_US.news.txt", "rb")
news <- readLines(con, skipNul = TRUE)
close(con)

con <- file("C:/Users/Barbara/Downloads/Coursera-SwiftKey/final/en_US/en_US.twitter.txt", "rb")
twitter <- readLines(con, skipNul = TRUE)
close(con)

con <- file("C:/Users/Barbara/Downloads/Coursera-SwiftKey/final/en_US/en_US.blogs.txt", "rb")
blogs <- readLines(con, skipNul = TRUE)
close(con)

The number of lines in each corpus are:

twitter: 2360148

news: 1010242

blogs: 899288

Samples of 10,000 lines were randomly selected from each file.

set.seed(123)
twit_samp <- sample(twitter, size = 10000)
set.seed(123)
blog_samp <- sample(blogs, size = 10000)
set.seed(123)
news_samp <- sample(news, size = 10000)

Preprocessing for each of the corpora to create stem documents

Includes creating a VectorSource and converting that to a volatile corpus. Numbers, whitespace, “SMART” stopwords, and punctuation were removed. The corpus was then converted to a stem document. Example of the code for the twitter corpus is shown, blog and news have similar code.

twitter_source <- VectorSource(twit_samp)
twitter_corpus <- VCorpus(twitter_source)
twitter_corpus <- tm_map(twitter_corpus, removeNumbers)
twitter_corpus <- tm_map(twitter_corpus, stripWhitespace)
twitter_corpus <- tm_map(twitter_corpus, content_transformer(tolower))
twitter_corpus <- tm_map(twitter_corpus, removeWords, stopwords("SMART"))
twitter_corpus <- tm_map(twitter_corpus, removePunctuation)
twitter_stem <- tm_map(twitter_corpus,stemDocument)

Calculate top 20 words in each corpus using qdap package and plot as bar charts

frequent_twit <- freq_terms(as.data.frame(twitter_stem), 20, stopwords = "doc")

Top 20 words in the twitter corpus

frequent_blog <- freq_terms(as.data.frame(blog_stem), 20, stopwords = "doc")

Top 20 words in the blog corpus

frequent_news <- freq_terms(as.data.frame(news_stem), 20, stopwords = "doc")

Top 20 words in the news corpus

Further development of this project

The corpora extracted will be used to develop predictive modelling for text entry. It will be developed as a Shiny app to use for speeding text entry when using a mobile device.