Overview

This is the milestone report for Week 2 of the JHU Coursera Data Science Certificate Capstone Project. This document will use my exploratory data analysis to describe the major features of the data in question and describe my plan for the capstone’s prediction algorithm and Shiny application.
The data used in this analysis was provided by SwiftKey and comes from three separate sources:

While versions of these sources were provided in English, Russian, German, and Finnish, we will focus solely on the English sources.

Load in Libraries Used

Here we will load in the libraries that we will use to load, process, and explore the data.

suppressMessages(suppressWarnings(library(knitr)))
suppressMessages(suppressWarnings(library(stringi)))
suppressMessages(suppressWarnings(library(ggplot2)))
suppressMessages(suppressWarnings(library(tm)))
suppressMessages(suppressWarnings(library(RWeka)))

Load in the Data

Now we will load the data we will be exploring into R. We will skip showing the download of the data, but the data used is available at here.

file_blogs <- "en_US/en_US.blogs.txt"
file_news <- "en_US/en_US.news.txt"
file_twitter <- "en_US/en_US.twitter.txt"
cons <- c(file_blogs, file_news, file_twitter)

for (i in 1:length(cons)){
  con <- file(cons[i], open = "r")
  assign(substr(cons[i], 13 , (nchar(cons[i]) - 4)), readLines(con = con, 
                                                             encoding = "UTF-8",
                                                             skipNul = TRUE))
  close(con = con)
}
rm(con, i)

Summary of the Data

Summary of the Data Files

We will now provide some basic information on the data files that we will be working with. This will include the size of the file in megabytes, the total number of lines, the total words, and some summary statistics on the words per line in each file.

 file_sizes <- round(file.info(cons)$size / 1024^2)
 total_lines <- sapply(list(blogs, news, twitter), length)
 total_words <- sapply(list(blogs, news, twitter), stri_stats_latex)[4,]
 wpl <- lapply(list(blogs, news, twitter), function(x) stri_count_words(x))
 wpl_summ = sapply(list(blogs, news, twitter),
                     function(x) summary(stri_count_words(x))[c('Min.', 'Mean', 'Max.')])
File File Size Total Lines Total Words Minimum Words per Line Mean Words per Line Maximum Words per Line
en_US/en_US.blogs.txt 200 MB 899288 37570839 0 41.7510731 6726
en_US/en_US.news.txt 196 MB 1010242 34494539 1 34.4099681 1796
en_US/en_US.twitter.txt 159 MB 2360148 30451170 1 12.7506466 47

Words per Line

 par(mfrow=c(3,1))
 hist(wpl[[1]], 
      breaks = 50, 
      main = "Words per Line: Blogs", 
      xlab = "Words per Line", 
      col = "blue")
 hist(wpl[[2]], breaks = 50, 
      main = "Words per Line: News", 
      xlab = "Words per Line", 
      col = "red")
 hist(wpl[[3]], breaks = 50, 
      main = "Words per Line: Twitter", 
      xlab = "Words per Line", 
      col = "green")

Sampling, Building a Corpus, and Data Cleaning

Sample the Data

Before we can process the data, we are going to need to take a sample of it. We can still make good predictions with a relatively small sample of the data and this will make our processing and analysis much easier.

set.seed(666)
s <- 0.01 #sample size

blogs_s <- sample(blogs, length(blogs)*s, replace = F)
news_s <- sample(news, length(news)*s, replace = F)
twitter_s <- sample(twitter, length(twitter)*s, replace = F)
sample_data <- c(blogs_s, news_s, twitter_s)
rm(blogs, blogs_s, news, news_s, twitter, twitter_s)

Build a Corpus

Now that our data has been sampled, we will transform it into a corpus we can work with using the tm package in R.

corpus <- VCorpus(VectorSource(sample_data))

Data Cleaning

With the data in a corpus format, we can now begin cleaning and processing our data.
Here we will remove:

  • Common stop words (i.e. “the”, “a”, “an”, “in”)
  • Numbers
  • Punctuation
  • URLs
  • Twitter handles
  • Email address patterns
  • Profanity
  • Whitespace

To remove profanity we will use a list of offensive language made available by the Carnegie Mellon School of Computer Science. We will also convert all characters to the lower case.

# Define a function to turn a specific pattern into a blank string
DeleteText <- content_transformer(function(x, patt) gsub(patt, " ", x))

# Remove URLs
corpus <- tm_map(corpus, DeleteText, "(f|ht)tp(s?)://(.*)[.][a-z]+")

# Remove Twitter handles
corpus <- tm_map(corpus, DeleteText, "@[^\\s]+")

# Remove email address patterns
corpus <- tm_map(corpus, DeleteText, "\\b[A-Z a-z 0-9._ - ]*[@](.*?)[.]{1,3} \\b")

# Send characters to lowers
corpus <- tm_map(corpus, content_transformer(tolower))

# Remove stop words
corpus <- tm_map(corpus, removeWords, stopwords(kind = "en"))

# Remove numbers
corpus <- tm_map(corpus, removeNumbers)

# Remove Punctuation
corpus <- tm_map(corpus, removePunctuation)

# Remove profanity
## Load in profanity data
con <- file("profanity.txt", open = "r")
profanity <- readLines(con = con, skipNul = T)
close(con)
## Remove offensive language
corpus <- tm_map(corpus, removeWords, profanity)

# Remove whitespace
corpus <- tm_map(corpus, stripWhitespace)

saveRDS(corpus, "english_corpus.rds")

Exploratory Data Analysis

To get a better understanding of our data, we will look at two main elements of the data:

We will use the RWeka package to create one, two, and three grams and then produce bar plots of the 10 most common of each of these N-Grams. Note that the frequency of a One-Gram is equivalent to the frequency of individual words.

N-Grams

First, we will need to create functions to tokenize our data into one, two, and three grams.

one_gram <- function(x){
  NGramTokenizer(x, Weka_control(min = 1, max = 1))
} 
two_gram <- function(x){
  NGramTokenizer(x, Weka_control(min = 2, max = 2))
} 
three_gram <- function(x){
  NGramTokenizer(x, Weka_control(min = 3, max = 3))
} 

One-Grams (Word Frequency)

Two-Grams

Three-Grams

Moving Forward

The overall goal of this project is to create a text prediction algorithm that will be presented as an R Shiny application. The app is required to take a word or phrase as input and produce a prediction for the next word in the phrase as output.

The prediction algorithm will be based on the lessons we learned from the exploratory analysis that we performed in this report. It will be done using n-gram modelling and word frequencies. The strategy that we will work with first will be to make a prediction based on the most common one-gram until a complete word has been input followed by a space. Once a full word has been entered, the prediction will be made based on the most common two-gram and so forth as more text and words are added.

Adjustments to the algorithm will be made based on what best increases accuracy and efficiency of the model.