Milestone Project

Introduction

In this Milestone report a beginning is made for the Coursera Data Science Capstone project. The goal of this project is to create a prediction model using a large amount of text in several documents. Natural Language Processing is a new process that is learned throughout this project.

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. The cornerstone of this project is to understand and build a predictive text model for their smart keyboard.

Loading in the data

First the data is downloaded from the website, provided by Swiftkey. There are 3 .txt data files in English which will be used: the twitter, the blogs and the news data set. Now we’re reading in the data and loading the specific libraries which are needed.

Reading in the data

Let’s read in the data from all the 3 documents, news, blogs and twitter.

setwd("~/Coursera/Milestone Project/Coursera-SwiftKey/final/en_US")
twitter <- readLines("en_US.twitter.txt")

## Warning in readLines("en_US.twitter.txt"): line 167155 appears to contain
## an embedded nul

## Warning in readLines("en_US.twitter.txt"): line 268547 appears to contain
## an embedded nul

## Warning in readLines("en_US.twitter.txt"): line 1274086 appears to contain
## an embedded nul

## Warning in readLines("en_US.twitter.txt"): line 1759032 appears to contain
## an embedded nul

news <- readLines("en_US.news.txt")

## Warning in readLines("en_US.news.txt"): incomplete final line found on
## 'en_US.news.txt'

blogs  <- readLines("en_US.blogs.txt")

Installing the packages

library(tm)

## Loading required package: NLP

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(RColorBrewer)
library(wordcloud)

Preprocessing the data

The choice is made to select a sample of the data as the train data.

# Because the files are too large to calculate with,
# there is only a 1% sample selected from the datasets.
set.seed(1000)
twitter2 <- sample(twitter, length(twitter)*0.01)
news2 <- sample(news, length(news)*0.01)
blogs2 <- sample(blogs, length(blogs)*0.01)

#Writing the files on my PC, 
# so I can remove the large files from my global environment.
setwd("~/Coursera/Milestone Project/Coursera-SwiftKey/final/en_US/nieuw")
write(twitter2, file = "Twitter.txt")
write(news2, file="News.txt")
write(blogs2, file="Blogs.txt")

rm(twitter); rm(blogs); rm(news)

Summary of statistics of the data

Now let’s generate some summary statistics which include word frequencies from all the 3 documents. (blogs, news and twitter) The frequencies are generated using a Corpus.

#Basic summaries of 3 files

#Load the new texts in R like Corpus files.
#A Corpus is a collection of documents in the R environment. 
#It involves loading files created in TextMining folder into a Corpus object.

#Create Corpus
setwd("~/Coursera/Milestone Project/Coursera-SwiftKey/final/en_US/nieuw")
docs <- Corpus(DirSource(("~/Coursera/Milestone Project/Coursera-SwiftKey/final/en_US/nieuw")))   
summary(docs)

##             Length Class             Mode
## Blogs.txt   2      PlainTextDocument list
## News.txt    2      PlainTextDocument list
## Twitter.txt 2      PlainTextDocument list

inspect(docs)

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 2088532
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 161573
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 1627378

## BLOGS DOCUMENT

#The upper command shows there are 3 files in the corpus.
#Now let's start making a Document Term Matrix.
#This is a matrix that will show the word frequencies.
#Start with the blog document
dtmblog <- DocumentTermMatrix(docs[1])   
freqblog <- colSums(as.matrix(dtmblog))   
length(freqblog)

## [1] 48944

ordblog <- order(freqblog)   
freqblog[head(ordblog)]

##      '07),       '56.        '60      '60s!       '90s 'america's 
##          1          1          1          1          1          1

freqblog[tail(ordblog)]

##   was  with   for  that   and   the 
##  2768  2841  3639  4446 10658 18362

#Now make a dataframe of the document matrix which will be used to plot the frequencies
dfblog <- data.frame(word=names(freqblog), freq=freqblog)  

## NEWS DOCUMENT

#For the news and twitter documents the same steps are used but with different names.
#Not all the steps are shown due to the concise style of the report. 

dtmnews <- DocumentTermMatrix(docs[2])   
freqnews <- colSums(as.matrix(dtmnews))   
length(freqnews)

## [1] 9035

ordnews <- order(freqnews)   
freqnews[head(ordnews)]

##     '60s     'cue      'it 'medical     'oh,   'thank 
##        1        1        1        1        1        1

freqnews[tail(ordnews)]

##  was with that  for  and  the 
##  192  235  246  247  674 1583

#Now make a dataframe of the document matrix for the plot of the frequencies
dfnews <- data.frame(word=names(freqnews), freq=freqnews)  

## TWITTER DOCUMENT

dtmtwit <- DocumentTermMatrix(docs[3])   
freqtwit <- colSums(as.matrix(dtmtwit))   
length(freqtwit)

## [1] 44831

ordtwit <- order(freqtwit)   
freqtwit[head(ordtwit)]

## \035amazop[;;;        ''would           '00,            '08           '08? 
##              1              1              1              1              1 
##            '09 
##              1

freqtwit[tail(ordtwit)]

## your that  for  and  you  the 
## 1719 2162 3826 4398 4705 9135

#Now make a dataframe of the document matrix for the frequency plot
dftwit <- data.frame(word=names(freqtwit), freq=freqtwit)

Basic plots

Basic histograms are shown below here to visualize the word frequencies.

#Plot the Word frequencies of the words that appear more then 2500 times in a histogram
#A frequency of 2500, 100 and 1500 is chosen because there are a lot of stopwords included.
#Without these stopwords there will be a better visualization, 
#but it doesn't give you the best solution from the prediction model.

#PLOT BLOGS FREQUENCIES
p <- ggplot(subset(dfblog, freq>2500), aes(word, freq))    
p <- p + geom_bar(stat="identity")+ labs(title="Blog document frequencies of most common words")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))   
p

#PLOT NEWS FREQUENCIES
q <- ggplot(subset(dfnews, freq>100), aes(word, freq))    
q <- q + geom_bar(stat="identity")+ labs(title="News document frequencies of most common words")
q <- q + theme(axis.text.x=element_text(angle=45, hjust=1))   
q

#PLOT TWITTER DOCUMENT
r <- ggplot(subset(dftwit, freq>1500), aes(word, freq))    
r <- r + geom_bar(stat="identity")+ labs(title="Twitter document frequencies of most common words")
r <- r + theme(axis.text.x=element_text(angle=45, hjust=1))   
r

Word clouds

Another nice feature is to show the frequencies of the words in word clouds. In the word clouds, the words that are used the most are larger.

#Make a word cloud for the blogs documents
#Plot the 100 most frequently used words in color
set.seed(1000) 
dark2 <- brewer.pal(6, "Dark2")   
wordcloud(names(freqblog), freqblog, max.words=100, rot.per=0.2, colors=dark2)

#Make a word cloud for the news documents
#Plot the 100 most frequently used words in color
set.seed(1000) 
dark2 <- brewer.pal(6, "Dark2")   
wordcloud(names(freqnews), freqnews, max.words=100, rot.per=0.2, colors=dark2)

#Make a word cloud for the twitter documents
#Plot the 100 most frequently used words in color
set.seed(1000) 
dark2 <- brewer.pal(6, "Dark2")   
wordcloud(names(freqtwit), freqtwit, max.words=100, rot.per=0.2, colors=dark2)

Next steps

Plans for creating a prediction algorithm and a Shiny app are there. The plan is to use an n-gram model for 2 words and for 3 words. To use the n-gram model, the data has to be preprocessed to remove numbers, capitalization and punctuation.

For the further analysis the ngram(x,n) function is used. This ngram model is in progress.

The Shiny app will consist of a text input box where the user can put in 1 word. Then the prediction algorithm will predict the next word what is most likely to appear.