In this Milestone report a beginning is made for the Coursera Data Science Capstone project. The goal of this project is to create a prediction model using a large amount of text in several documents. Natural Language Processing is a new process that is learned throughout this project.
Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. The cornerstone of this project is to understand and build a predictive text model for their smart keyboard.
First the data is downloaded from the website, provided by Swiftkey. There are 3 .txt data files in English which will be used: the twitter, the blogs and the news data set. Now we’re reading in the data and loading the specific libraries which are needed.
Let’s read in the data from all the 3 documents, news, blogs and twitter.
setwd("~/Coursera/Milestone Project/Coursera-SwiftKey/final/en_US")
twitter <- readLines("en_US.twitter.txt")
## Warning in readLines("en_US.twitter.txt"): line 167155 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 268547 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 1274086 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 1759032 appears to contain
## an embedded nul
news <- readLines("en_US.news.txt")
## Warning in readLines("en_US.news.txt"): incomplete final line found on
## 'en_US.news.txt'
blogs <- readLines("en_US.blogs.txt")
library(tm)
## Loading required package: NLP
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(RColorBrewer)
library(wordcloud)
The choice is made to select a sample of the data as the train data.
# Because the files are too large to calculate with,
# there is only a 1% sample selected from the datasets.
set.seed(1000)
twitter2 <- sample(twitter, length(twitter)*0.01)
news2 <- sample(news, length(news)*0.01)
blogs2 <- sample(blogs, length(blogs)*0.01)
#Writing the files on my PC,
# so I can remove the large files from my global environment.
setwd("~/Coursera/Milestone Project/Coursera-SwiftKey/final/en_US/nieuw")
write(twitter2, file = "Twitter.txt")
write(news2, file="News.txt")
write(blogs2, file="Blogs.txt")
rm(twitter); rm(blogs); rm(news)
Now let’s generate some summary statistics which include word frequencies from all the 3 documents. (blogs, news and twitter) The frequencies are generated using a Corpus.
#Basic summaries of 3 files
#Load the new texts in R like Corpus files.
#A Corpus is a collection of documents in the R environment.
#It involves loading files created in TextMining folder into a Corpus object.
#Create Corpus
setwd("~/Coursera/Milestone Project/Coursera-SwiftKey/final/en_US/nieuw")
docs <- Corpus(DirSource(("~/Coursera/Milestone Project/Coursera-SwiftKey/final/en_US/nieuw")))
summary(docs)
## Length Class Mode
## Blogs.txt 2 PlainTextDocument list
## News.txt 2 PlainTextDocument list
## Twitter.txt 2 PlainTextDocument list
inspect(docs)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 2088532
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 161573
##
## [[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 1627378
## BLOGS DOCUMENT
#The upper command shows there are 3 files in the corpus.
#Now let's start making a Document Term Matrix.
#This is a matrix that will show the word frequencies.
#Start with the blog document
dtmblog <- DocumentTermMatrix(docs[1])
freqblog <- colSums(as.matrix(dtmblog))
length(freqblog)
## [1] 48944
ordblog <- order(freqblog)
freqblog[head(ordblog)]
## '07), '56. '60 '60s! '90s 'america's
## 1 1 1 1 1 1
freqblog[tail(ordblog)]
## was with for that and the
## 2768 2841 3639 4446 10658 18362
#Now make a dataframe of the document matrix which will be used to plot the frequencies
dfblog <- data.frame(word=names(freqblog), freq=freqblog)
## NEWS DOCUMENT
#For the news and twitter documents the same steps are used but with different names.
#Not all the steps are shown due to the concise style of the report.
dtmnews <- DocumentTermMatrix(docs[2])
freqnews <- colSums(as.matrix(dtmnews))
length(freqnews)
## [1] 9035
ordnews <- order(freqnews)
freqnews[head(ordnews)]
## '60s 'cue 'it 'medical 'oh, 'thank
## 1 1 1 1 1 1
freqnews[tail(ordnews)]
## was with that for and the
## 192 235 246 247 674 1583
#Now make a dataframe of the document matrix for the plot of the frequencies
dfnews <- data.frame(word=names(freqnews), freq=freqnews)
## TWITTER DOCUMENT
dtmtwit <- DocumentTermMatrix(docs[3])
freqtwit <- colSums(as.matrix(dtmtwit))
length(freqtwit)
## [1] 44831
ordtwit <- order(freqtwit)
freqtwit[head(ordtwit)]
## \035amazop[;;; ''would '00, '08 '08?
## 1 1 1 1 1
## '09
## 1
freqtwit[tail(ordtwit)]
## your that for and you the
## 1719 2162 3826 4398 4705 9135
#Now make a dataframe of the document matrix for the frequency plot
dftwit <- data.frame(word=names(freqtwit), freq=freqtwit)
Basic histograms are shown below here to visualize the word frequencies.
#Plot the Word frequencies of the words that appear more then 2500 times in a histogram
#A frequency of 2500, 100 and 1500 is chosen because there are a lot of stopwords included.
#Without these stopwords there will be a better visualization,
#but it doesn't give you the best solution from the prediction model.
#PLOT BLOGS FREQUENCIES
p <- ggplot(subset(dfblog, freq>2500), aes(word, freq))
p <- p + geom_bar(stat="identity")+ labs(title="Blog document frequencies of most common words")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
p
#PLOT NEWS FREQUENCIES
q <- ggplot(subset(dfnews, freq>100), aes(word, freq))
q <- q + geom_bar(stat="identity")+ labs(title="News document frequencies of most common words")
q <- q + theme(axis.text.x=element_text(angle=45, hjust=1))
q
#PLOT TWITTER DOCUMENT
r <- ggplot(subset(dftwit, freq>1500), aes(word, freq))
r <- r + geom_bar(stat="identity")+ labs(title="Twitter document frequencies of most common words")
r <- r + theme(axis.text.x=element_text(angle=45, hjust=1))
r
Another nice feature is to show the frequencies of the words in word clouds. In the word clouds, the words that are used the most are larger.
#Make a word cloud for the blogs documents
#Plot the 100 most frequently used words in color
set.seed(1000)
dark2 <- brewer.pal(6, "Dark2")
wordcloud(names(freqblog), freqblog, max.words=100, rot.per=0.2, colors=dark2)
#Make a word cloud for the news documents
#Plot the 100 most frequently used words in color
set.seed(1000)
dark2 <- brewer.pal(6, "Dark2")
wordcloud(names(freqnews), freqnews, max.words=100, rot.per=0.2, colors=dark2)
#Make a word cloud for the twitter documents
#Plot the 100 most frequently used words in color
set.seed(1000)
dark2 <- brewer.pal(6, "Dark2")
wordcloud(names(freqtwit), freqtwit, max.words=100, rot.per=0.2, colors=dark2)
Plans for creating a prediction algorithm and a Shiny app are there. The plan is to use an n-gram model for 2 words and for 3 words. To use the n-gram model, the data has to be preprocessed to remove numbers, capitalization and punctuation.
For the further analysis the ngram(x,n) function is used. This ngram model is in progress.
The Shiny app will consist of a text input box where the user can put in 1 word. Then the prediction algorithm will predict the next word what is most likely to appear.