In this capstone project, we are working on understanding and building predictive text models like those used by SwiftKey. When someone types a word, the keyboard presents several options for what the next word might be. Throughout this milestone report (week2 of the capstone project), I’ll try to demonstrate that I successfully loaded the data into my R workspace, and will present, step by step, the techniques used to clean the data and to build the corpus (based on the 3 documents provided : blogs, news and twitter). The tokenization was the key process in this work so far. It’s the process of breaking a stream of text up into words or phrases, or other meaningful elements called tokens. The list of tokens becomes input for advanced exploratory analysis or any further post-processing. I will display plots and tables of the main results and interesting facts so we can understand better the corpus.
These libraries have been added as and when needed
library(tm)
library(SnowballC)
library(RWeka)
library(ngram)
library(ggplot2)
library(cowplot)
library(wordcloud)
Our training data (can be downloaded from the below link) is an english data base, composed from 3 text documents (blogs, news and twitter) link
setwd("D:/MOOC/Data science Specialization/C10 - Data science capstone project/Original Dataset/final")
#Reading the file "en_US.blogs.txt"
file1 <- "en_US/en_US.blogs.txt"; con <- file(file1,open = "rb")
usblog <- readLines(con, skipNul = TRUE); close(con)
#Reading the file "en_US.news.txt"
file2 <- "en_US/en_US.news.txt"; con <- file(file2,open = "rb")
usnews <- readLines(con,skipNul = TRUE); close(con)
#Reading the file "en_US.twitter.txt"
file3 <- "en_US/en_US.twitter.txt"; con <- file(file3,open = "rb")
ustwitter <- readLines(con,skipNul = TRUE); close(con)
In this section, we will calculate : - The length of the longest line in each data set - The total number of words in each data set - The max number of words/line in each data set and plot a table with main statistics of the documents
nblines <- sapply(list(usblog,usnews,ustwitter),length)
nbchar <- sapply(list(usblog,usnews,ustwitter),nchar)
stat_sum <- cbind(c("blog","news","twitter"),nblines,sapply(nbchar,sum),sapply(nbchar,max))
stat_table <- as.data.frame.array(stat_sum)
colnames(stat_table) <- c("file","Nb_lines","Nb_words","Max_WpL")
knitr::kable(stat_table)
| file | Nb_lines | Nb_words | Max_WpL |
|---|---|---|---|
| blog | 899288 | 208361438 | 40835 |
| news | 1010242 | 203791405 | 11384 |
| 2360148 | 162385035 | 213 |
Given that the tweet is limited in characters number, hence, we noticed that is the data set with the fewest words per line
Due to the limited computational capacity of my machine, and the huge number of words available in each document, we will perform a “random sampling” of our data (keep only 5% of the original data). Then, we’ll tackle the cleaning task using the transformations of the “tm” package, and finally, we’ll combine the 3 data bases to build our corpus. But first, we remove all non-english words :
usblog_en <- sapply(usblog,function(word) iconv(word, "latin1", "ASCII", sub=""))
usnews_en <- sapply(usnews,function(word) iconv(word, "latin1", "ASCII", sub=""))
ustwitter_en <- sapply(ustwitter,function(word) iconv(word, "latin1", "ASCII", sub=""))
#Re-calculating the total number of words
nbchar_en <- sapply(list(usblog_en,usnews_en,ustwitter_en),nchar)
sapply(nbchar_en,sum)
## [1] 206043906 202917604 161961555
The number of words has decreased, but not in a relevant percentage (less than 1%)
set.seed(111); blog <- sample(usblog_en,length(usblog_en)*0.05)
set.seed(222); news <- sample(usnews_en,length(usnews_en)*0.05)
set.seed(333); twitter <- sample(ustwitter_en,length(ustwitter_en)*0.05)
corpus <- VCorpus(VectorSource(c(blog,news,twitter)),readerControl=list(reader=readPlain,language="en"))
Cleaning the data is the one of the most steps in analysis.The tm package offers a number of transformations that ease the data cleaning process
We often use underscores (especially in twitter) and hyphens without spaces between the words separated by them. Using the removePunctuation transform without fixing this will cause the two words on either side of the symbols to be combined. So we need to fix this first before considering any transformations.
toSpace <- content_transformer(function(x, pattern) gsub(pattern," ", x))
corpus <- tm_map(corpus,toSpace,"-")
corpus <- tm_map(corpus,toSpace,"_")
As mentionned before, the “tm” package offers a number of transformations, we can show them by typing in the prompt getTransformations() command
getTransformations()
## [1] "removeNumbers" "removePunctuation" "removeWords"
## [4] "stemDocument" "stripWhitespace"
In addition to the above operations, we’ll use the “tolower” and “PlainTextDocument” transformations.Most of these are self-explanatory, I’m giving below some clarifications about those that are less intuitive. StemDocument: Stemming is the process of reducing words to their common root (playing to play for example) StopWords: These include words such as articles, conjunctions, common verbs … that we don’t want to predict
corpus <- tm_map(corpus,content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus,stemDocument)
NOw that our corpus is clean, we’ll start to look for interesting facts of the data in hand. Tokenization will be our first challenge as our tokens will be the input of all post-processing operations.
token1 <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
token2 <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
token3 <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtm1 <- DocumentTermMatrix(corpus, control = list(tokenize = token1))
dtm2 <- DocumentTermMatrix(corpus, control = list(tokenize = token2))
dtm3 <- DocumentTermMatrix(corpus, control = list(tokenize = token3))
Due to processing power limits,We filter only on words/phrases with frequency higher than 50 (cannot sort all the corpus)
unigram <- findFreqTerms(dtm1,lowfreq = 200)
bigram <- findFreqTerms(dtm2,lowfreq=50)
trigram <- findFreqTerms(dtm3,lowfreq=50)
freq1 <- colSums(as.matrix(dtm1[,unigram]))
freq2 <- colSums(as.matrix(dtm2[,bigram]))
freq3 <- colSums(as.matrix(dtm3[,trigram]))
#Filtering the top10 N-Grams
freq11 <- data.frame(word=names(freq1),frequency=freq1,row.names = NULL)
df1 <- freq11[order(-freq11$frequency),][1:10,]
freq22 <- data.frame(word=names(freq2),frequency=freq2,row.names = NULL)
df2 <- freq22[order(-freq22$frequency),][1:10,]
freq33 <- data.frame(word=names(freq3),frequency=freq3,row.names = NULL)
df3 <- freq33[order(-freq33$frequency),][1:10,]
plot1 <- ggplot(data=df1, aes(x=word, y=frequency,fill=frequency))+
geom_bar(stat="identity")+guides(fill=FALSE)+
theme(axis.text.x=element_text(angle=90))+
scale_x_discrete(limits=df1$word)+
labs(title="Top10 Unigrams")+xlab("words")
plot2 <- ggplot(data=df2, aes(x=word, y=frequency))+
geom_bar(stat="identity",fill="darkgreen")+guides(fill=FALSE)+
theme(axis.text.x=element_text(angle=90))+
scale_x_discrete(limits=df2$word)+
labs(title="Top10 Bigrams")+xlab("")+ylab("")
plot3 <- ggplot(data=df3, aes(x=word, y=frequency))+
geom_bar(stat="identity",fill="orange")+guides(fill=FALSE)+
theme(axis.text.x=element_text(angle=90))+
scale_x_discrete(limits=df3$word)+
labs(title="Top10 Trigrams")+xlab("")+ylab("")
#Combining the 3 plots in one row
plot_grid(plot1,plot2,plot3,nrow = 1,ncol = 3)
The plot of the 100 most frequently used words
set.seed(5)
wordcloud(names(freq1), freq1, max.words=100, scale=c(3, 0.1), colors=brewer.pal(8, "Dark2"))
In this project, we were able to load, understand and tokenize our data. This gaves us a head start in the next development stages, of fitting a predictive model using our tokens as input, and building a shiny application similar to SwiftKey keyboard.