Executive Summary

The project consists on building a Shiny application that is capable of predicting the next word the user would enter, based on the words entered previously. In this report I demonstrate how to load the data, followed by an exploratory analysis and cleaning workflow to end up with a data set that can be used to develop and train a prediction algorithm.

The Capstone Project

The objective of this project is to develop an application that predicts the next word a person will type based on the words typed previously. The app relies of a prediction algorithm that has been trained with data obtained from blogs, news websites and twitter. While th ultimate goal of the project is to develop an online application using Shiny, this report is limited to loading the data, conducting an exploratory analysis and cleaning the data, in preparation for developing a prediction model.

The Data

The training data set contains text obtained from blogs, news and twitter in English, German, Russian and Finnish. In this project we will only consider the data set in English.

Libraries

Several library chosen to begin with are as below:

library(RColorBrewer)
library(rJava)
library(RWeka)
library(NLP)
library(tm)
library(stringi)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
library(wordcloud)
library(RColorBrewer)
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidytext)
library(stopwords)
## 
## Attaching package: 'stopwords'
## The following object is masked from 'package:tm':
## 
##     stopwords
options(mc.cores=1)

Load the Data

#news.con <- file('en_US_news.txt','rb')
news <- read_lines('en_US.news.txt')#,encoding = 'UTF-8',open='rb')
#blog.con <- file('en_US_blogs.txt','rb')
blog <- read_lines('en_US.blogs.txt')#,encoding='UTF-8')
#twit.con <- file('en_US_twitter.txt','rb')
twitter <- read_lines('en_US.twitter.txt')#,encoding='UTF-8')
#closeAllConnections()

Examine the data sets

We’ll check how many lines and characters each data set has and and how much memory each data set requires.

stri_stats_general(blog)
##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   206824382   170389539
object.size(blog)
## 267758632 bytes
stri_stats_general(news)
##       Lines LinesNEmpty       Chars CharsNWhite 
##     1010242     1010242   203223154   169860866
object.size(news)
## 269840992 bytes
stri_stats_general(twitter)
##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162096031   134082634
object.size(twitter)
## 334484736 bytes

Our text samples have millions of lines. As a result, we’ll need to take a sampling of each file and use that to conduct some representative statistical analysis.

Create Samples

Here we’ll create samples based on a random selection of lines. We’ll take 1% of each sample. To save memory, we’ll remove the full samples once we’ve done this.

set.seed(500)
blogsamp <- blog[rbinom(length(blog),1,.01)==1]
newssamp <- news[rbinom(length(news),1,.01)==1]
twitsamp <- twitter[rbinom(length(twitter),1,.01)==1]
rm(list=c('blog','news','twitter'))

View Samples

head(blogsamp,3)
## [1] "Due to the nature of my job (raising children) I find my cleaning standards have slipped somewhat. After a day spent washing up, sweeping food off the kitchen floor, wiping all the surfaces, picking up toys and random objects up off every floor in the house, cooking, looking after my kids, making beds, and hanging washing out, seventeen times over, I have no remaining energy to do any actual cleaning."
## [2] "NIGHT OF THE DRIBBLER"                                                                                                                                                                                                                                                                                                                                                                                               
## [3] "A QUESTION TO ANSWER. Why Isn’t There More Healing? There are many hindrances to receiving healing. We look at just a couple of them."
head(newssamp,3)
## [1] "If Mr. Buffett has a guilty conscience, he can increase donations to charity or simply pay more in taxes. But suggesting we raise the taxes on the wealthy will lead to more problems than it solves."                                                                                                                                 
## [2] "Minnesota at Detroit, 1 p.m."                                                                                                                                                                                                                                                                                                          
## [3] "Shakespeare's tragic tale of star-crossed lovers is the only play in Great Lakes history to be staged by all six of the company's permanent artistic directors, beginning with Arthur Lithgow's production in 1963 to Charles Fee's iteration, set in a post-World War I, totalitarian state, currently on stage at the Hanna Theatre."
head(twitsamp,3)
## [1] "Doing some #cufon testing and implementation in #wordpress. What the FOUC d'y'all think of cufon?"                                          
## [2] "YourKoz Games+Find out who this is-$2.00+She concentrates on the calibre and quality of education for Jordanian children & global education"
## [3] "Wearing my 2011 shirt to the #Skrillex show in Indy tonight. Already got a lot of compliments."

Preprocess Data

To process the data into a form suitable for analysis, we’ll use tidy principles and the tidytext library. We’ll filter twitter handles, hashtages, email addresses, and punctuation from the data, then we’ll add our three samples into one corpus object. Finally, we’ll process the corpus to remove numbers, strip whitespace, convert everything to lowercase, divide longer strings into individual words, and ensure only alphanumeric characters are represented.

#combine data into data fram of text and source:
alldata <- full_join(data.frame(text=blogsamp,source="blog"),data.frame(text=newssamp,source='news'))
## Joining, by = c("text", "source")
alldata <- alldata %>% full_join(data.frame(text=twitsamp,source='twitter'))
## Joining, by = c("text", "source")
# Remove emails, twitter handles, and hashtags, and replace punctuation characters with spaces

fdata <- alldata %>% mutate(text=gsub('[@][a-zA-Z0-9_]{1,15}','',text)) %>% #twitter handles
  mutate(text = gsub('#\\b[A-Za-z0-9._-]*\\b','',text)) %>% #hashtags
  mutate(text=gsub('\\b[A-Za-z0-9._-]*[@](.*?)[.].{1,3}\\b','',text)) %>% #email addresses
  mutate(text=gsub('<U+.{4}>','',text)) %>% #emoji'
  mutate(text=gsub('[^0-9A-Za-z\' ]','',text)) %>% #anything not alphanumeric or ' or spaces
  mutate(text=removeNumbers(text)) %>% #remove numbers
  mutate(doc_id = row_number()) #keep unique document identifiers

#now we'll split our data set into three tidy tables of unigrams, bigrams, and trigrams:
tidywords <- fdata %>% unnest_tokens(word,text)
tidybis <- fdata %>% unnest_tokens(bigram,text,token='ngrams',n=2)
tidytris <- fdata %>% unnest_tokens(trigram,text,token='ngrams',n=3)

Preliminary Analysis

At last, we’re ready to do some analysis. We’ll start with some analysis of individual words.

wordfreq <-  tidywords %>% count(word,sort=TRUE)
numwords <- dim(wordfreq)[1]

g <-   ggplot(head(filter(wordfreq,!word %in% get_stopwords()$word),25), 
              aes(x = reorder(word,n), y = n)) +
        geom_bar(stat = "identity", fill="darkseagreen2", colour="black") +
        xlab("Word") + ylab("Count") + 
        ggtitle("Top 25 Unigrams by Word Frequency (excl. Stop Words)") +
        theme(plot.title = element_text(lineheight=.8, face="bold")) + coord_flip()

print(g)

The 20 most popular words (excluding very commonly used words) are presented in the bar chart above.

Histogram

Let’s take a look at a histogram of word frequencies. In this case we’ve taken the log of the frequencies to avoid skewing the histogram too far to the left.

hist(log(wordfreq$n))

Dictionary Coverage

Let’s look at the dictionary coverage. First we’ll calculate what percentage of words in the total corpus are covered by a dictionary of each word size.

coverage = rep(0,numwords)
nwords=sum(wordfreq$n)

for (i in 1:numwords){
  coverage[i] = sum(wordfreq$n[1:i])/nwords
}
wordfreq <- mutate(wordfreq,coverage=coverage)

Now let’s plot it:

seventyfive= sum(wordfreq$coverage<.75)+1
ninety=sum(wordfreq$coverage <.9)+1
g2 <- qplot(y=coverage,data=wordfreq) +geom_vline(xintercept = seventyfive)+geom_text(aes(x=seventyfive,label=paste('75% coverage @ ',seventyfive,' words'),y=.4),angle=90,vjust=1)+geom_vline(xintercept = ninety) + geom_text(aes(x=ninety,label=paste('90% coverage @ ',ninety,' words'),y=.4),angle=90,vjust=1)    +     xlab("# Words") + ylab("% Coverage") + 
        ggtitle("Dictionary Coverage Chart")
print(g2)