The project consists on building a Shiny application that is capable of predicting the next word the user would enter, based on the words entered previously. In this report I demonstrate how to load the data, followed by an exploratory analysis and cleaning workflow to end up with a data set that can be used to develop and train a prediction algorithm.
The objective of this project is to develop an application that predicts the next word a person will type based on the words typed previously. The app relies of a prediction algorithm that has been trained with data obtained from blogs, news websites and twitter. While th ultimate goal of the project is to develop an online application using Shiny, this report is limited to loading the data, conducting an exploratory analysis and cleaning the data, in preparation for developing a prediction model.
The training data set contains text obtained from blogs, news and twitter in English, German, Russian and Finnish. In this project we will only consider the data set in English.
Several library chosen to begin with are as below:
library(RColorBrewer)
library(rJava)
library(RWeka)
library(NLP)
library(tm)
library(stringi)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(wordcloud)
library(RColorBrewer)
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidytext)
library(stopwords)
##
## Attaching package: 'stopwords'
## The following object is masked from 'package:tm':
##
## stopwords
options(mc.cores=1)
#news.con <- file('en_US_news.txt','rb')
news <- read_lines('en_US.news.txt')#,encoding = 'UTF-8',open='rb')
#blog.con <- file('en_US_blogs.txt','rb')
blog <- read_lines('en_US.blogs.txt')#,encoding='UTF-8')
#twit.con <- file('en_US_twitter.txt','rb')
twitter <- read_lines('en_US.twitter.txt')#,encoding='UTF-8')
#closeAllConnections()
We’ll check how many lines and characters each data set has and and how much memory each data set requires.
stri_stats_general(blog)
## Lines LinesNEmpty Chars CharsNWhite
## 899288 899288 206824382 170389539
object.size(blog)
## 267758632 bytes
stri_stats_general(news)
## Lines LinesNEmpty Chars CharsNWhite
## 1010242 1010242 203223154 169860866
object.size(news)
## 269840992 bytes
stri_stats_general(twitter)
## Lines LinesNEmpty Chars CharsNWhite
## 2360148 2360148 162096031 134082634
object.size(twitter)
## 334484736 bytes
Our text samples have millions of lines. As a result, we’ll need to take a sampling of each file and use that to conduct some representative statistical analysis.
Here we’ll create samples based on a random selection of lines. We’ll take 1% of each sample. To save memory, we’ll remove the full samples once we’ve done this.
set.seed(500)
blogsamp <- blog[rbinom(length(blog),1,.01)==1]
newssamp <- news[rbinom(length(news),1,.01)==1]
twitsamp <- twitter[rbinom(length(twitter),1,.01)==1]
rm(list=c('blog','news','twitter'))
head(blogsamp,3)
## [1] "Due to the nature of my job (raising children) I find my cleaning standards have slipped somewhat. After a day spent washing up, sweeping food off the kitchen floor, wiping all the surfaces, picking up toys and random objects up off every floor in the house, cooking, looking after my kids, making beds, and hanging washing out, seventeen times over, I have no remaining energy to do any actual cleaning."
## [2] "NIGHT OF THE DRIBBLER"
## [3] "A QUESTION TO ANSWER. Why Isn’t There More Healing? There are many hindrances to receiving healing. We look at just a couple of them."
head(newssamp,3)
## [1] "If Mr. Buffett has a guilty conscience, he can increase donations to charity or simply pay more in taxes. But suggesting we raise the taxes on the wealthy will lead to more problems than it solves."
## [2] "Minnesota at Detroit, 1 p.m."
## [3] "Shakespeare's tragic tale of star-crossed lovers is the only play in Great Lakes history to be staged by all six of the company's permanent artistic directors, beginning with Arthur Lithgow's production in 1963 to Charles Fee's iteration, set in a post-World War I, totalitarian state, currently on stage at the Hanna Theatre."
head(twitsamp,3)
## [1] "Doing some #cufon testing and implementation in #wordpress. What the FOUC d'y'all think of cufon?"
## [2] "YourKoz Games+Find out who this is-$2.00+She concentrates on the calibre and quality of education for Jordanian children & global education"
## [3] "Wearing my 2011 shirt to the #Skrillex show in Indy tonight. Already got a lot of compliments."
To process the data into a form suitable for analysis, we’ll use tidy principles and the tidytext library. We’ll filter twitter handles, hashtages, email addresses, and punctuation from the data, then we’ll add our three samples into one corpus object. Finally, we’ll process the corpus to remove numbers, strip whitespace, convert everything to lowercase, divide longer strings into individual words, and ensure only alphanumeric characters are represented.
#combine data into data fram of text and source:
alldata <- full_join(data.frame(text=blogsamp,source="blog"),data.frame(text=newssamp,source='news'))
## Joining, by = c("text", "source")
alldata <- alldata %>% full_join(data.frame(text=twitsamp,source='twitter'))
## Joining, by = c("text", "source")
# Remove emails, twitter handles, and hashtags, and replace punctuation characters with spaces
fdata <- alldata %>% mutate(text=gsub('[@][a-zA-Z0-9_]{1,15}','',text)) %>% #twitter handles
mutate(text = gsub('#\\b[A-Za-z0-9._-]*\\b','',text)) %>% #hashtags
mutate(text=gsub('\\b[A-Za-z0-9._-]*[@](.*?)[.].{1,3}\\b','',text)) %>% #email addresses
mutate(text=gsub('<U+.{4}>','',text)) %>% #emoji'
mutate(text=gsub('[^0-9A-Za-z\' ]','',text)) %>% #anything not alphanumeric or ' or spaces
mutate(text=removeNumbers(text)) %>% #remove numbers
mutate(doc_id = row_number()) #keep unique document identifiers
#now we'll split our data set into three tidy tables of unigrams, bigrams, and trigrams:
tidywords <- fdata %>% unnest_tokens(word,text)
tidybis <- fdata %>% unnest_tokens(bigram,text,token='ngrams',n=2)
tidytris <- fdata %>% unnest_tokens(trigram,text,token='ngrams',n=3)
At last, we’re ready to do some analysis. We’ll start with some analysis of individual words.
wordfreq <- tidywords %>% count(word,sort=TRUE)
numwords <- dim(wordfreq)[1]
g <- ggplot(head(filter(wordfreq,!word %in% get_stopwords()$word),25),
aes(x = reorder(word,n), y = n)) +
geom_bar(stat = "identity", fill="darkseagreen2", colour="black") +
xlab("Word") + ylab("Count") +
ggtitle("Top 25 Unigrams by Word Frequency (excl. Stop Words)") +
theme(plot.title = element_text(lineheight=.8, face="bold")) + coord_flip()
print(g)
The 20 most popular words (excluding very commonly used words) are presented in the bar chart above.
Let’s take a look at a histogram of word frequencies. In this case we’ve taken the log of the frequencies to avoid skewing the histogram too far to the left.
hist(log(wordfreq$n))
Let’s look at the dictionary coverage. First we’ll calculate what percentage of words in the total corpus are covered by a dictionary of each word size.
coverage = rep(0,numwords)
nwords=sum(wordfreq$n)
for (i in 1:numwords){
coverage[i] = sum(wordfreq$n[1:i])/nwords
}
wordfreq <- mutate(wordfreq,coverage=coverage)
Now let’s plot it:
seventyfive= sum(wordfreq$coverage<.75)+1
ninety=sum(wordfreq$coverage <.9)+1
g2 <- qplot(y=coverage,data=wordfreq) +geom_vline(xintercept = seventyfive)+geom_text(aes(x=seventyfive,label=paste('75% coverage @ ',seventyfive,' words'),y=.4),angle=90,vjust=1)+geom_vline(xintercept = ninety) + geom_text(aes(x=ninety,label=paste('90% coverage @ ',ninety,' words'),y=.4),angle=90,vjust=1) + xlab("# Words") + ylab("% Coverage") +
ggtitle("Dictionary Coverage Chart")
print(g2)