This is an update report for the Capstone class in the Coursera Data Science specialization. The data for this project can be found at https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. The data is composed of several text files of which we are using the English language files for blogs, twitter posts and new feeds.
There are several different ways to download the files depending on what you want to do with them. For example the read.table command can be used to bring the files in as tables of words. The best method is to bring them in using the structure of text mining packages to allow for the use of higher level functions.
setwd("/Users/jwhitney/Documents/Coursera/Capstone")
library(tm)
set.seed(567)
# Read in three big files
twitter <- readLines(file("en_US.twitter.txt"))
news <- readLines(file("en_US.news.txt"))
blogs <- readLines(file("en_US.blogs.txt"))
The size of these files can be found as follows:
twitterLines<-length(twitter) #determine the number of lines
#determine the number of words
twitterWords <- unlist(strsplit(twitter,split=" "))
twitterUnique <-length(unique(twitterWords))
newsLines<-length(news) #determine the number of lines
#determine the number of words
newsWords <- unlist(strsplit(news,split=" "))
newsUnique <-length(unique(newsWords)) # Gives the number of unique words
blogsLines<-length(blogs) #determine the number of lines
#determine the number of words
blogsWords <- unlist(strsplit(blogs,split=" "))
blogsUnique <-length(unique(blogsWords))
The size of the files are as follows: The Twitter files has 2360148 lines and 1291070 unique words; the News file has 1010242 lines and 876772 unique words; the Blogs file has 899288 lines and 1103548 unique words.
The exploratory data analysis was done on a corpus created from 1% of the lines from each of the three files. One percent was chosen because even looking at just 10% of the data was very slow. The following code shows how the small files can be created.
setwd("/Users/jwhitney/Documents/Coursera/Capstone")
library(tm)
set.seed(567)
# choose the amount you want in the small file
percent<-0.01
#NEWS
lines<-length(news)
SmallSet<-sample(news,size=lines*percent,replace=FALSE)
writeLines(SmallSet,"SmallNews.txt",sep="\n", useBytes=FALSE)
#BLOGS
lines<-length(blogs)
SmallSet<-sample(blogs,size=lines*percent,replace=FALSE)
writeLines(SmallSet,"SmallBlogs.txt",sep="\n", useBytes=FALSE)
#TWITTER
lines<-length(twitter)
SmallSet<-sample(twitter,size=lines*percent,replace=FALSE)
writeLines(SmallSet,"SmallTwitter.txt",sep="\n", useBytes=FALSE)
It turned out to be best to remove the non-ASCII characters (foreign words) prior to creating a corpus. The following code does that.
# move the three small files into a folder called smallFiles
setwd("~/CourseraR/Capstone/unzipped/smallFiles")
smallBlog <- readLines(file("SmallBlogs.txt"))
smallTwitter<-readLines(file("SmallTwitter.txt"))
smallNews<-readLines(file("SmallNews.txt"))
blogRows<-length(smallBlog)
twitterRows<-length(smallTwitter)
newsRows<-length(smallNews)
# Remove non ASCII text from the documents
for(i in 1:blogRows){
row<-smallBlog[i]
row3<-iconv(row,"latin1","ASCII",sub="")
smallBlog[i]<-row3
}
writeLines(smallBlog,"smallBlogs.txt")
for(i in 1:twitterRows){
row<-smallTwitter[i]
row3<-iconv(row,"latin1","ASCII",sub="")
smallTwitter[i]<-row3
}
writeLines(smallTwitter,"smallTwitter.txt")
for(i in 1:newsRows){
row<-smallNews[i]
row3<-iconv(row,"latin1","ASCII",sub="")
smallNews[i]<-row3
}
writeLines(smallNews,"smallNews.txt")
At this point the three files are replaced with english-only files. The next step is to create a Corpus and clean the data. A corpus is a file structure that holds multiple text files. The cleaning commands remove numbers, punctuation and extra white spaces (leaving one between words). The inspect command shows a small segment of the Document Term Matrix, a matrix which gives each term and shows how often it appears.
#### now make a corpus
library(tm)
## Loading required package: NLP
setwd("/Users/jwhitney/Documents/Coursera/Capstone")
source<-DirSource("smallFiles/")
smallCorpus<-Corpus(source,readerControl=list(reader=readPlain))
summary(smallCorpus)
## Length Class Mode
## SmallBlogs.txt 2 PlainTextDocument list
## SmallNews.txt 2 PlainTextDocument list
## SmallTwitter.txt 2 PlainTextDocument list
library(RWeka)
# preprocess smallCorpus
smallCorpus<-tm_map(smallCorpus, removeNumbers)
smallCorpus<-tm_map(smallCorpus, removePunctuation)
smallCorpus<-tm_map(smallCorpus, stripWhitespace)
smallCorpus<-tm_map(smallCorpus, content_transformer(tolower))
adtm<-DocumentTermMatrix(smallCorpus)
adtm<-removeSparseTerms(adtm,0.75)
inspect(adtm[1:3,1:3])
## <<DocumentTermMatrix (documents: 3, terms: 3)>>
## Non-/sparse entries: 4/5
## Sparsity : 56%
## Maximal term length: 10
## Weighting : term frequency (tf)
##
## Terms
## Docs aaa aaaaahhhhh aaaackk
## SmallBlogs.txt 1 0 0
## SmallNews.txt 6 0 0
## SmallTwitter.txt 0 1 1
The prompt suggested a histogram of terms might also be nice. The words at the bottom are examples of the kinds of words found at different freqencies.
freq<-colSums(as.matrix(adtm))
ord<-order(-freq)
freqThousand<-head(freq[ord],1000)
options(scipen=3)
barplot(freqThousand,main="Histogram of 1000 most frequent words")
From the histogram it is easy to see that “the” is the most common word. This also shows up in a word ball.
library(wordcloud)
## Loading required package: RColorBrewer
freq<-colSums(as.matrix(adtm))
ord<-order(-freq)
mostfreq<-as.data.frame(freq[head(ord,100)])
wordcloud(rownames(mostfreq),mostfreq[,1],scale=c(3,1),max.words=100,random.order=F,colors=brewer.pal(8,"Dark2"))
The goal is to create a predictor that gives the user the next most likely word given a few words as input. This is like the typing predictor for texting. There will be several stages to do this.
First the large text files will need to be pre-processed like the smaller samples. They will have the non-english removed (a one-time thing), as well as obscene word removel. To do that I plan to check words against a published list which can be found at: https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words.
After that cleaning of the text files they will then be loaded and built in to a corpus. Just like the small data they will have the pre-processing described here. There will be no removal of stop words (things like “the” shown in the histogram)because they are likely to be useful for predictions.
To prepare for machine learning I will need to create a data frame of n-grams which can be used in the machine learning packages. It looks like the tokenized term document matrix can be converted into a data frame. I expect to try multiple models agaist the data to find one the gives decent accuracy and can predict relatively quickly. Support vector machines have been recommended in my research. Using word stems should also both speed the prediction process and improve the accuracy.
The shiny app will need to take a few words (as defined by the ngram predictor) and return the next word in the sequence. The input words will need to be processed before being put in the predictor.