Introduction

In this capstone project, we are working on understanding and building predictive text models like those used by SwiftKey. When someone types a word, the keyboard presents several options for what the next word might be. Throughout this milestone report (week2 of the capstone project), I’ll try to demonstrate that I successfully loaded the data into my R workspace, and will present, step by step, the techniques used to clean the data and to build the corpus (based on the 3 documents provided : blogs, news and twitter). The tokenization was the key process in this work so far. It’s the process of breaking a stream of text up into words or phrases, or other meaningful elements called tokens. The list of tokens becomes input for advanced exploratory analysis or any further post-processing. I will display plots and tables of the main results and interesting facts so we can understand better the corpus.

Libraries

These libraries have been added as and when needed

library(tm)
library(SnowballC)
library(RWeka)
library(ngram)
library(ggplot2)
library(cowplot)
library(wordcloud)

Load the data

Our training data (can be downloaded from the below link) is an english data base, composed from 3 text documents (blogs, news and twitter) link

setwd("D:/MOOC/Data science Specialization/C10 - Data science capstone project/Original Dataset/final")
#Reading the file "en_US.blogs.txt"
file1 <- "en_US/en_US.blogs.txt"; con <- file(file1,open = "rb")
usblog <- readLines(con, skipNul = TRUE); close(con)
#Reading the file "en_US.news.txt"
file2 <- "en_US/en_US.news.txt"; con <- file(file2,open = "rb")
usnews <- readLines(con,skipNul = TRUE); close(con)
#Reading the file "en_US.twitter.txt"
file3 <- "en_US/en_US.twitter.txt"; con <- file(file3,open = "rb")
ustwitter <- readLines(con,skipNul = TRUE); close(con)

Calculate the main statistics

In this section, we will calculate : - The length of the longest line in each data set - The total number of words in each data set - The max number of words/line in each data set and plot a table with main statistics of the documents

nblines <- sapply(list(usblog,usnews,ustwitter),length)
nbchar <- sapply(list(usblog,usnews,ustwitter),nchar)
stat_sum <- cbind(c("blog","news","twitter"),nblines,sapply(nbchar,sum),sapply(nbchar,max))
stat_table <- as.data.frame.array(stat_sum)
colnames(stat_table) <- c("file","Nb_lines","Nb_words","Max_WpL")

Table of basic summaries (words and lines counts)

knitr::kable(stat_table)

file	Nb_lines	Nb_words	Max_WpL
blog	899288	208361438	40835
news	1010242	203791405	11384
twitter	2360148	162385035	213

Given that the tweet is limited in characters number, hence, we noticed that is the data set with the fewest words per line

Sampling, building the corpus and cleaning

Due to the limited computational capacity of my machine, and the huge number of words available in each document, we will perform a “random sampling” of our data (keep only 5% of the original data). Then, we’ll tackle the cleaning task using the transformations of the “tm” package, and finally, we’ll combine the 3 data bases to build our corpus. But first, we remove all non-english words :

usblog_en <- sapply(usblog,function(word) iconv(word, "latin1", "ASCII", sub=""))
usnews_en <- sapply(usnews,function(word) iconv(word, "latin1", "ASCII", sub=""))
ustwitter_en <- sapply(ustwitter,function(word) iconv(word, "latin1", "ASCII", sub=""))
#Re-calculating the total number of words
nbchar_en <- sapply(list(usblog_en,usnews_en,ustwitter_en),nchar)
sapply(nbchar_en,sum)

## [1] 206043906 202917604 161961555

The number of words has decreased, but not in a relevant percentage (less than 1%)

Sampling the data

set.seed(111); blog <- sample(usblog_en,length(usblog_en)*0.05)
set.seed(222); news <- sample(usnews_en,length(usnews_en)*0.05)
set.seed(333); twitter <- sample(ustwitter_en,length(ustwitter_en)*0.05)

Build the corpus

corpus <- VCorpus(VectorSource(c(blog,news,twitter)),readerControl=list(reader=readPlain,language="en"))

Clean the data

Cleaning the data is the one of the most steps in analysis.The tm package offers a number of transformations that ease the data cleaning process

We often use underscores (especially in twitter) and hyphens without spaces between the words separated by them. Using the removePunctuation transform without fixing this will cause the two words on either side of the symbols to be combined. So we need to fix this first before considering any transformations.

toSpace <- content_transformer(function(x, pattern) gsub(pattern," ", x))
corpus <- tm_map(corpus,toSpace,"-")
corpus <- tm_map(corpus,toSpace,"_")

The list of transformations

As mentionned before, the “tm” package offers a number of transformations, we can show them by typing in the prompt getTransformations() command

getTransformations()

## [1] "removeNumbers"     "removePunctuation" "removeWords"      
## [4] "stemDocument"      "stripWhitespace"

In addition to the above operations, we’ll use the “tolower” and “PlainTextDocument” transformations.Most of these are self-explanatory, I’m giving below some clarifications about those that are less intuitive. StemDocument: Stemming is the process of reducing words to their common root (playing to play for example) StopWords: These include words such as articles, conjunctions, common verbs … that we don’t want to predict

corpus <- tm_map(corpus,content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus,stemDocument)

Exploratory analysis

NOw that our corpus is clean, we’ll start to look for interesting facts of the data in hand. Tokenization will be our first challenge as our tokens will be the input of all post-processing operations.

Tokenization and calculating N-Grams frequencies

token1 <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
token2 <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
token3 <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

Create the DocumentTermMatrix

dtm1 <- DocumentTermMatrix(corpus, control = list(tokenize = token1))
dtm2 <- DocumentTermMatrix(corpus, control = list(tokenize = token2))
dtm3 <- DocumentTermMatrix(corpus, control = list(tokenize = token3))

Due to processing power limits,We filter only on words/phrases with frequency higher than 50 (cannot sort all the corpus)

unigram <- findFreqTerms(dtm1,lowfreq = 200)
bigram <- findFreqTerms(dtm2,lowfreq=50)
trigram <- findFreqTerms(dtm3,lowfreq=50)

calculating N-Grams frequencies

freq1 <- colSums(as.matrix(dtm1[,unigram]))
freq2 <- colSums(as.matrix(dtm2[,bigram]))
freq3 <- colSums(as.matrix(dtm3[,trigram]))
#Filtering the top10 N-Grams
freq11 <- data.frame(word=names(freq1),frequency=freq1,row.names = NULL)
df1 <- freq11[order(-freq11$frequency),][1:10,] 
freq22 <- data.frame(word=names(freq2),frequency=freq2,row.names = NULL)
df2 <- freq22[order(-freq22$frequency),][1:10,] 
freq33 <- data.frame(word=names(freq3),frequency=freq3,row.names = NULL)
df3 <- freq33[order(-freq33$frequency),][1:10,]

Plots

Histograms of top10 N-grams

plot1 <- ggplot(data=df1, aes(x=word, y=frequency,fill=frequency))+
      geom_bar(stat="identity")+guides(fill=FALSE)+
      theme(axis.text.x=element_text(angle=90))+
      scale_x_discrete(limits=df1$word)+
      labs(title="Top10 Unigrams")+xlab("words")
plot2 <- ggplot(data=df2, aes(x=word, y=frequency))+
      geom_bar(stat="identity",fill="darkgreen")+guides(fill=FALSE)+
      theme(axis.text.x=element_text(angle=90))+
      scale_x_discrete(limits=df2$word)+
      labs(title="Top10 Bigrams")+xlab("")+ylab("")
plot3 <- ggplot(data=df3, aes(x=word, y=frequency))+
      geom_bar(stat="identity",fill="orange")+guides(fill=FALSE)+
      theme(axis.text.x=element_text(angle=90))+
      scale_x_discrete(limits=df3$word)+
      labs(title="Top10 Trigrams")+xlab("")+ylab("")
#Combining the 3 plots in one row
plot_grid(plot1,plot2,plot3,nrow = 1,ncol = 3)

WordCloud

The plot of the 100 most frequently used words

set.seed(5)   
wordcloud(names(freq1), freq1, max.words=100, scale=c(3, 0.1), colors=brewer.pal(8, "Dark2"))

Conclusion and perspectives

In this project, we were able to load, understand and tokenize our data. This gaves us a head start in the next development stages, of fitting a predictive model using our tokens as input, and building a shiny application similar to SwiftKey keyboard.

Cleaning data and exploratory analysis (Data science capstone project)

Khalil ElKhiari

18 décembre 2016