This is the milestone report for Week 2 of the JHU Coursera Data Science Certificate Capstone Project. This document will use my exploratory data analysis to describe the major features of the data in question and describe my plan for the capstone’s prediction algorithm and Shiny application.
The data used in this analysis was provided by SwiftKey and comes from three separate sources:
While versions of these sources were provided in English, Russian, German, and Finnish, we will focus solely on the English sources.
Here we will load in the libraries that we will use to load, process, and explore the data.
suppressMessages(suppressWarnings(library(knitr)))
suppressMessages(suppressWarnings(library(stringi)))
suppressMessages(suppressWarnings(library(ggplot2)))
suppressMessages(suppressWarnings(library(tm)))
suppressMessages(suppressWarnings(library(RWeka)))
Now we will load the data we will be exploring into R. We will skip showing the download of the data, but the data used is available at here.
file_blogs <- "en_US/en_US.blogs.txt"
file_news <- "en_US/en_US.news.txt"
file_twitter <- "en_US/en_US.twitter.txt"
cons <- c(file_blogs, file_news, file_twitter)
for (i in 1:length(cons)){
con <- file(cons[i], open = "r")
assign(substr(cons[i], 13 , (nchar(cons[i]) - 4)), readLines(con = con,
encoding = "UTF-8",
skipNul = TRUE))
close(con = con)
}
rm(con, i)
We will now provide some basic information on the data files that we will be working with. This will include the size of the file in megabytes, the total number of lines, the total words, and some summary statistics on the words per line in each file.
file_sizes <- round(file.info(cons)$size / 1024^2)
total_lines <- sapply(list(blogs, news, twitter), length)
total_words <- sapply(list(blogs, news, twitter), stri_stats_latex)[4,]
wpl <- lapply(list(blogs, news, twitter), function(x) stri_count_words(x))
wpl_summ = sapply(list(blogs, news, twitter),
function(x) summary(stri_count_words(x))[c('Min.', 'Mean', 'Max.')])
| File | File Size | Total Lines | Total Words | Minimum Words per Line | Mean Words per Line | Maximum Words per Line |
|---|---|---|---|---|---|---|
| en_US/en_US.blogs.txt | 200 MB | 899288 | 37570839 | 0 | 41.7510731 | 6726 |
| en_US/en_US.news.txt | 196 MB | 1010242 | 34494539 | 1 | 34.4099681 | 1796 |
| en_US/en_US.twitter.txt | 159 MB | 2360148 | 30451170 | 1 | 12.7506466 | 47 |
par(mfrow=c(3,1))
hist(wpl[[1]],
breaks = 50,
main = "Words per Line: Blogs",
xlab = "Words per Line",
col = "blue")
hist(wpl[[2]], breaks = 50,
main = "Words per Line: News",
xlab = "Words per Line",
col = "red")
hist(wpl[[3]], breaks = 50,
main = "Words per Line: Twitter",
xlab = "Words per Line",
col = "green")
Before we can process the data, we are going to need to take a sample of it. We can still make good predictions with a relatively small sample of the data and this will make our processing and analysis much easier.
set.seed(666)
s <- 0.01 #sample size
blogs_s <- sample(blogs, length(blogs)*s, replace = F)
news_s <- sample(news, length(news)*s, replace = F)
twitter_s <- sample(twitter, length(twitter)*s, replace = F)
sample_data <- c(blogs_s, news_s, twitter_s)
rm(blogs, blogs_s, news, news_s, twitter, twitter_s)
Now that our data has been sampled, we will transform it into a corpus we can work with using the tm package in R.
corpus <- VCorpus(VectorSource(sample_data))
With the data in a corpus format, we can now begin cleaning and processing our data.
Here we will remove:
To remove profanity we will use a list of offensive language made available by the Carnegie Mellon School of Computer Science. We will also convert all characters to the lower case.
# Define a function to turn a specific pattern into a blank string
DeleteText <- content_transformer(function(x, patt) gsub(patt, " ", x))
# Remove URLs
corpus <- tm_map(corpus, DeleteText, "(f|ht)tp(s?)://(.*)[.][a-z]+")
# Remove Twitter handles
corpus <- tm_map(corpus, DeleteText, "@[^\\s]+")
# Remove email address patterns
corpus <- tm_map(corpus, DeleteText, "\\b[A-Z a-z 0-9._ - ]*[@](.*?)[.]{1,3} \\b")
# Send characters to lowers
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove stop words
corpus <- tm_map(corpus, removeWords, stopwords(kind = "en"))
# Remove numbers
corpus <- tm_map(corpus, removeNumbers)
# Remove Punctuation
corpus <- tm_map(corpus, removePunctuation)
# Remove profanity
## Load in profanity data
con <- file("profanity.txt", open = "r")
profanity <- readLines(con = con, skipNul = T)
close(con)
## Remove offensive language
corpus <- tm_map(corpus, removeWords, profanity)
# Remove whitespace
corpus <- tm_map(corpus, stripWhitespace)
saveRDS(corpus, "english_corpus.rds")
To get a better understanding of our data, we will look at two main elements of the data:
We will use the RWeka package to create one, two, and three grams and then produce bar plots of the 10 most common of each of these N-Grams. Note that the frequency of a One-Gram is equivalent to the frequency of individual words.
First, we will need to create functions to tokenize our data into one, two, and three grams.
one_gram <- function(x){
NGramTokenizer(x, Weka_control(min = 1, max = 1))
}
two_gram <- function(x){
NGramTokenizer(x, Weka_control(min = 2, max = 2))
}
three_gram <- function(x){
NGramTokenizer(x, Weka_control(min = 3, max = 3))
}
The overall goal of this project is to create a text prediction algorithm that will be presented as an R Shiny application. The app is required to take a word or phrase as input and produce a prediction for the next word in the phrase as output.
The prediction algorithm will be based on the lessons we learned from the exploratory analysis that we performed in this report. It will be done using n-gram modelling and word frequencies. The strategy that we will work with first will be to make a prediction based on the most common one-gram until a complete word has been input followed by a space. Once a full word has been entered, the prediction will be made based on the most common two-gram and so forth as more text and words are added.
Adjustments to the algorithm will be made based on what best increases accuracy and efficiency of the model.