Table 1. File Size, Line Count and Row Count of the Datasets
File	Size (MB)	Number of Lines	Number of Words
blogs	200.4	899,288	37,570,839
news	196.3	1,010,242	34,494,539
twitter	159.4	2,360,148	30,451,128

Exploratory Data Analysis

The first step in building a text prediction model is understanding the distribution and relationship between words, tokens, and phrases in the text. In this sense, some exploratory data analysis was performed over the corpus. We tokenize the sample into unigrams, bigrams and trigrams using RWeka package. Figures 1, 2 and 3 show the most frequent combinations found for one word, two words and three words.

Fig. 1 Most frequent One-word

Figure 2. Most frequent Two-words combination

Figure 3. Most frequent Three-words combination

We also analyzed the number of words needed to cover a corpus. To do this, the unigram dictionary sorted by frequency obtained from the sample data was used over the entire twitter dataset. Results are showed in Table 2. An important aspect is that with less than 2000 different words, it is possible to cover more than 80% of the text. This allows us to think that probably prediction algorithms based on small dictionaries might perform well in text prediction algorithms.

Table 2. Number of Words needed to cover % of twitter data
Words	percentage
5	10
26	20
62	30
136	40
269	50
555	60
1,021	70
1,886	80

Plans for creating the prediction algorithm and Shiny app

For the prediction algorithm we will use frequency sorted dictionary for unigrams, bigrams and trigrams with backing-off to models with smaller histories under certain condition. The model with the most reliable information about a given history will be used to provide the better results.

The Shiny app will have a text box to enter texts. In another box, the app will show the top 3 predicted choices. User should be able to select the word he/she would like to use next if it is among the predicted words. The app will have a section with instructions.

Appendix

Libraries required

library("stringi")
library("pander")
library("tm")
library("qdap")
library("RWeka")
library("ggplot2")
library("knitr")

setwd("C:/Users/Adsi/Documents/Coursera/Data Science/10_Capstone project/")

Data Read

# Blogs Data in binary mode
connection <- file("./final/en_US/en_US.blogs.txt", open = "rb")
blogs <- readLines(connection, encoding = "UTF-8")

close(connection)

# News data in binary mode
connection <- file("./final/en_US/en_US.news.txt", open = "rb")
news <- readLines(connection, encoding = "UTF-8")
close(connection)

# Twitter data in binary mode
connection <- file("./final/en_US/en_US.twitter.txt", open = "rb")
twitter <- readLines(connection, encoding = "UTF-8")
close(connection)

rm(connection)

Descriptive Statistics

n <- 3
desc   <- data.frame( File  = rep(0,n),
                      Size  = rep(0,n),
                      Lines = rep(0,n), 
                      Words  =rep(0,n))


Files <- c("./final/en_US/en_US.blogs.txt", "./final/en_US/en_US.news.txt",
           "./final/en_US/en_US.twitter.txt")

desc$File   <-  c("blogs","news","twitter")
tempo       <-  lapply(list(blogs,news,twitter), stri_stats_latex)
tempo       <-  data.frame(matrix(unlist(tempo), nrow=3, byrow=T)) 
desc$Words  <-  tempo$X4
tempo       <-  unlist(lapply(Files, file.size))
desc$Size   <-  tempo/(1024*1024)
desc$Lines  <-  unlist(lapply(list(blogs,news,twitter), length))
names(desc)<- c("File","Size (MB)","Number of Lines","Number of Words")


rm(Files,n,tempo)

Data Sampling

set.seed(1234)

mysample.blogs     <- blogs[rbinom(length(blogs)*.05, length(blogs), .5)]
mysample.news      <- news[rbinom(length(news)*.05, length(news), .5)]
mysample.twitter   <- twitter[rbinom(length(twitter)*.05, length(twitter), .5)]
mysample           <- c(mysample.blogs,mysample.news,mysample.twitter)

rm(mysample.blogs,mysample.news,mysample.twitter)

mycorpus <- Corpus(VectorSource(mysample))
rm(blogs, news, twitter)
rm(mysample)
gc()

Profanity List

connection <- file("final/swearWords.txt", open = "r")
profanityWords  <-readLines(connection)
close(connection)

rm(connection)

Data Cleaning

# Cleaning the corpus object

toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))})
toEmpty <- content_transformer(function(x, pattern) {return (gsub(pattern, "", x))})

mycorpus<-tm_map(mycorpus, toSpace,"[^[:graph:]]")  
mycorpus<-tm_map(mycorpus, toEmpty, "#\\w+")       # hashtags
mycorpus<-tm_map(mycorpus, toEmpty , "(\\b\\S+\\@\\S+\\..{1,3}(\\s)?\\b)") # email address
mycorpus<-tm_map(mycorpus, toEmpty, "@\\w+")  
mycorpus<-tm_map(mycorpus, toEmpty, "http[^[:space:]]*")
mycorpus<-tm_map(mycorpus, toSpace, "/|@|\\|") 
mycorpus<-tm_map(mycorpus, removeNumbers)
mycorpus<-tm_map(mycorpus, content_transformer(tolower))
mycorpus<-tm_map(mycorpus, removeWords, profanityWords) 
mycorpus<-tm_map(mycorpus, content_transformer(strip),char.keep="'")
mycorpus<-tm_map(mycorpus, stripWhitespace)

Creation of Term-Document-Matrix for token analysis

Generating 1-gram

UniGramas         <- nGramas(UniGram,low = 1, high=Inf)
UniGramas         <- UniGramas[unlist(sort(UniGramas$frequency,index.return=T,decreasing=TRUE)$ix),]

head(UniGramas,20)

OneGram <- UniGramas
One<- OneGram
names(One)<- c("word1","frequency")

1-gram Plot

Generating 2-gram

BiGramas        <- nGramas(BiGram, low=1,high=Inf)
BiGramas        <- BiGramas[unlist(sort(BiGramas$frequency,index.return=T,decreasing=TRUE)$ix),]
head(BiGramas,20)

# Bigrams are saved in a text file for later use. To avoid generating bigrams again
write.table(BiGramas,file="Bigramas.csv",append=TRUE, sep=",", row.names=FALSE, col.names=TRUE)
TwoGram <- read.csv("Bigramas.csv", header=TRUE, sep=",")

# Bigram are divided into two words to facilitate search 
Two <- read.table(text = as.character(TwoGram$ngram), sep = " ", colClasses = "character")
Two<-cbind(Two,TwoGram$frequency)
names(Two) <- c("Word1","Word2","frequency")

2-gram Plot

g <- ggplot(TwoGram[1:40,], aes(x=reorder(ngram, frequency), y=frequency, fill=frequency)) +
     geom_bar(stat = "identity") +  coord_flip() +
     theme(legend.title=element_blank()) +
     theme(plot.title = element_text(size=9))+
     theme(axis.title=element_text(size=8)) +
     theme(axis.text=element_text(size=7))+
     xlab("2-gram") + ylab("Frequency") +
     labs(title = "Top 40 Bigrams")
print(g)

Generating 3-grams

TriGramas     <- nGramas(TriGram, low=300,high=Inf)
TriGramas     <- TriGramas[unlist(sort(TriGramas$frequency,index.return=T,decreasing=TRUE)$ix),]



# Trigrams are saved in a text file for later use. To avoid generating Trigrams again
write.table(TriGramas,file="Trigramas.csv",append=TRUE, sep=",", row.names=FALSE, col.names=TRUE)
ThreeGram <- read.csv("Trigramas.csv", header=TRUE, sep=",")

# Trigram are divided into three words to facilitate search 
Three <- read.table(text = as.character(ThreeGram$ngram), sep = " ", colClasses = "character")
Three<-cbind(Three,ThreeGram$frequency)
names(Three) <- c("Word1","Word2","word3","frequency")

3-gram Plot

g <- ggplot(ThreeGram[1:40,], aes(x=reorder(ngram, frequency), y=frequency, fill=frequency)) +
     geom_bar(stat = "identity") +  coord_flip() +
     theme(legend.title=element_blank()) +
     theme(plot.title = element_text(size=9))+
     theme(axis.title=element_text(size=8)) +
     theme(axis.text=element_text(size=7))+
     xlab("3-gram") + ylab("Frequency") +
     labs(title = "Top 40 Trigrams")
print(g)

Session Info

## R version 3.2.0 (2015-04-16)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] knitr_1.12.3           ggplot2_2.0.0          RWeka_0.4-26          
##  [4] qdap_2.2.4             RColorBrewer_1.1-2     qdapTools_1.3.1       
##  [7] qdapRegex_0.6.0        qdapDictionaries_1.0.6 tm_0.6-2              
## [10] NLP_0.1-8              pander_0.5.2           stringi_1.0-1         
## 
## loaded via a namespace (and not attached):
##  [1] gtools_3.5.0        wordcloud_2.5       venneuler_1.1-0    
##  [4] slam_0.1-32         reshape2_1.4.1      rJava_0.9-6        
##  [7] reports_0.1.4       colorspace_1.2-6    htmltools_0.3      
## [10] yaml_2.1.13         chron_2.3-45        XML_3.98-1.2       
## [13] DBI_0.3.1           plyr_1.8.3          stringr_1.0.0      
## [16] munsell_0.4.2       gtable_0.1.2        evaluate_0.8       
## [19] labeling_0.3        RWekajars_3.7.13-1  gender_0.5.1       
## [22] parallel_3.2.0      xlsxjars_0.6.1      Rcpp_0.12.2        
## [25] scales_0.3.0        formatR_1.2.1       gdata_2.17.0       
## [28] plotrix_3.6-1       xlsx_0.5.7          openNLPdata_1.5.3-2
## [31] gridExtra_2.0.0     digest_0.6.9        dplyr_0.4.3        
## [34] grid_3.2.0          tools_3.2.0         bitops_1.0-6       
## [37] magrittr_1.5        RCurl_1.95-4.7      data.table_1.9.6   
## [40] assertthat_0.1      rmarkdown_0.8       openNLP_0.2-6      
## [43] R6_2.1.1            igraph_1.0.1

Data Science Capstone Milestone

Angela Di Serio

April 24, 2016

Executive Summary

Getting Data

Sampling Data

Preprocessing