Synopsis

The goal of this porject is to build a predictive text mining application. It aims at predicting the next word from the previous number of words. In this report, I showed how I downloaded and loaded the datasets, presented some general characteristics of them, and expalin how I am going to approach the problem to get your valuable feedback.

Downloading and loading data

We load the data using the followin code. The data is available here

The needs special care while loading. “en_US.news.txt” as it may not be read completely

blogs<-readLines("en_US.blogs.txt")
twitter<-readLines("en_Us.twitter.txt",skipNul=TRUE)
con <- file("en_US.news.txt", open="rb")
news <- readLines(con, encoding="UTF-8")
close(con)
rm(con)

Summary statistics about the datasets.

s1<-file.size("en_US.blogs.txt")/1024/1024
s2<-file.size( "en_US.news.txt")/1024/1024
s3<-file.size("en_US.twitter.txt")/1024/1024
size<-rbind(s1,s2,s3)
library(R.utils)
n1<-countLines("en_US.blogs.txt")
n2<-countLines( "en_US.news.txt")
n3<-countLines("en_US.twitter.txt")
n<-rbind(n1,n2,n3)
library(magrittr)
library(stringi)
blogsWrds<- stri_count_words(blogs)
twitterWrds<-stri_count_words(twitter)
newsWrds<-stri_count_words(news)
wrds<-rbind(summary(blogsWrds),summary(newsWrds),summary(twitterWrds))
wrds<-as.data.frame(wrds)
wrds<-wrds[,c(3,6)]
totblogs<-sum(blogsWrds);tottwitter<-sum(twitterWrds);totnews<-sum(newsWrds);tot<-rbind(totblogs,totnews,tottwitter)
wrds$tot<-tot
wrds$size<-size
rownames(wrds)<-c("Blogs","News","Twitter")
colnames(wrds)<-c("Mdn W/L", "Max W/L", "Tot W","File size (MB)")

The following table shows that the median number of words per line was the highest in the news dataset, while the maximal number of words was in the blogs dataset. The total number of words was lowest in the Twitter dataset, around 5 000 000 words more in the News dataset; around 8 000 000 words more in the Blogs dataset.

Table 1. Some general characteristics of the three datasets

Mdn W/L Max W/L Tot W File size (MB)
Blogs 29 6,726 38,154,238 200.4242
News 32 1,796 34,762,396 196.2775
Twitter 12 60 30,218,166 159.3641

Sampling

library(magrittr)
set.seed(1412)
sampleBlogs<- blogs[sample(n1,size=0.0005*n1)]
sampleNews<-news[sample(n2,size=0.0005*n2)]
sampleTwitter<- twitter[sample(n3,size=0.0005*n3)]
dat<-c(sampleBlogs,sampleNews,sampleTwitter) %>% unlist %>% as.character
save(dat,file="sample.RData")

I sampled 0.05% from every dataset. Thus, 449, 505 and 1180 lines are extracted from Blogs, News and Twitter datasets respectively. Then, I combined the three samples together in one setdat of 2134 lines

Cleaning datasets from profanity words.

From this link, I downloaded a list of profanity words to delete them from the Term Data Matrices. I did not remove them at the beginin as they may have an impact on prediction. For example, ‘You are “banned word” man’ is an example for a sentence that contains a profanity word. If I remove the profanity word before tokenization, this sentence will be ‘You are man’ ; and after tokenization, man may be predicted after “You are”. This prediction is wrong. Thus, I decided to remove profanity words after tokenization. In this way, “You are”banned word“” will be removed as an entire block.

Breaking the whole sample into sentences.

Instead of dealing with “line” as the unit, I thought that “sentence” is better choice. In this sentence,“I will go to the park. This is good day!”. If I deal with it as one line, then my model may predict “this” after “the park”. This will be inaccurate prediction, as “this” is the begining of a new sentence and should not be predicted from “the park”. The following code used for breaking the whole sample into ‘r length(dat_sent)’ sentences instead of ‘r length(dat)’ lines

library(openNLP)
library(NLP)
library(magrittr)
library(dplyr)
library(qdap)
library(stringr)
#First, convert the data frame into string
s<-as.String(dat)
r<-str_trim(clean(s))
#Second, I don't know how this code works but it works :)
sent_token_annotator <- Maxent_Sent_Token_Annotator()
a1 <- annotate(s, sent_token_annotator)
## Extract sentences 
dat_sent<-s[a1]

Remove non alphabetic terms

Then, I removed any thing but words as the app will predict the next “word”.

dat_sent <- gsub("[^a-zA-Z0-9 ]", "", dat_sent)
save(dat_sent,file="cleansentences.RData")

Tokenization

Tokenization means breaking a stream of text up into words and phrases.I prepared every sentence to be broken three times; every word, every two words and every three words.

library(tm)
library(RWeka)
library(magrittr)
library(dplyr)
### forming corpus and removing unnecessary white spaces
tm_dat <-  dat_sent %>% 
  as.data.frame %>%
  DataframeSource %>%
  VCorpus%>% 
  tm_map( stripWhitespace )
rm(dat_sent)

Characteristics of Unigrams

#### Unigrams
# I tokenized the sentences into one-gram tokens after removing punctuation and numbers, then, I converted tdm into a matrix
load("beforeTokenization.RData")
library(tm)
library(magrittr)
library(dplyr)
library(ggplot2)
tdm1<-TermDocumentMatrix(tm_dat,control = list( 
  removePunctuation = TRUE,
  removeNumbers=TRUE,
  wordLengths=c(0, Inf))) %>% as.matrix
rm(tm_dat)
## Removing profanity words and empty spaces
profanity<-read.csv("Terms-to-Block.csv",skip=4)[,2]
profanity<-gsub(",","",profanity)
profanity<-c(profanity,"^$")
strings1<-rownames(tdm1)
matches1 <- sapply(profanity, grepl, strings1, ignore.case=TRUE)
selection1<-apply(matches1,1,any)
### Creating word frequency table from Unigrams
wft1<-tdm1 %>% as.data.frame %>%
  mutate(Terms=rownames(tdm1),Freq=rowSums(.)) %>% 
  arrange(desc(Freq)) %>% 
  mutate(Cum=cumsum(Freq)/sum(Freq)) %>%
  select(Terms,Freq,Cum)
#Percentage of words forms around 80% of the whole words
commonwrd1<-wft1 %>%
  filter(Cum<=0.90)

There were around 436 unigrams that were removed as they were profanity terms. Out of the remaining 10330 unigrams, 5393 unigrams representing around 52.21 % of the whole unigrams

Table 2. The most frequent ten unigrams

Terms Freq Cum
the 2,277 0.05
to 1,408 0.07
and 1,201 0.10
a 1,169 0.12
of 979 0.14
in 820 0.16
i 720 0.17
621 0.19
that 537 0.20
is 528 0.21

Figure1. The most common 10 unigrams

Characteristics of Bigrams

### Bigrams
load("beforeTokenization.RData")
library(tm);library(magrittr);library(dplyr)
token2 <- function(x) NGramTokenizer(x, Weka_control(min = 2L, max = 2L))
# I tokenized the sentences into two-gram tokens after removing punctuation and numbers, then, I converted tdm into a matrix
tdm2<-TermDocumentMatrix(tm_dat,control = list( 
  removePunctuation = TRUE,
  removeNumbers=TRUE,
  wordLengths=c(0, Inf),
  tokenize=token2))%>% as.matrix
rm(tm_dat)
## Removing profanity words and empty spaces
profanity<-read.csv("Terms-to-Block.csv",skip=4)[,2]
profanity<-gsub(",","",profanity)
profanity<-c(profanity,"^$")
strings2<-rownames(tdm2)
matches2 <- sapply(profanity, grepl, strings2, ignore.case=TRUE)
selection2<-apply(matches2,1,any)
### Creating word frequency table from bigrams
wft2<-tdm2 %>%  as.data.frame %>%
  mutate(Terms=rownames(tdm2),Freq=rowSums(.)) %>% 
  arrange(desc(Freq)) %>% 
  mutate(Cum=cumsum(Freq)/sum(Freq)) %>%
  select(Terms,Freq,Cum)
#Percentage of bigrams forms around 80% of the bigrams
commonwrd2<-wft2 %>%
  filter(Cum<=0.8)

There were around 1374 bigrams that were removed as they contain profanity terms. Out of the remaining 34134 bigrams, 24879 bigrams representing around 73 % of the whole bigrams.

Table 3. The most frequent ten bigrams

Terms Freq Cum
in the 214 0.00
of the 203 0.01
on the 112 0.01
to the 109 0.01
to be 90 0.02
for the 89 0.02
at the 77 0.02
and the 64 0.02
in a 64 0.02
it was 51 0.02

Figure1. The most common 10 bigrams

Characteristics of Trigrams

Trigrams

load("beforeTokenization.RData")
library(tm);library(magrittr);library(dplyr)
token3 <- function(x) NGramTokenizer(x, Weka_control(min = 3L, max = 3L))
# I tokenized the sentences into three-gram tokens after removing punctuation and numbers, then, I converted tdm into a matrix
tdm3<-TermDocumentMatrix(tm_dat,control = list( 
  removePunctuation = TRUE,
  removeNumbers=TRUE,
  wordLengths=c(0, Inf),
  stripWhitespace=T,
    tokenize=token3)) %>% as.matrix
rm(tm_dat)
## Removing profanity words and empty spaces
profanity<-read.csv("Terms-to-Block.csv",skip=4)[,2]
profanity<-gsub(",","",profanity)
profanity<-c(profanity,"^$")
strings3<-rownames(tdm3)
matches3 <- sapply(profanity, grepl, strings3, ignore.case=T)
selection3<-apply(matches3,1,any)

### Creating word frequency table from trigrams
wft3<-tdm3 %>%  as.data.frame %>%
  mutate(Terms=rownames(tdm3),Freq=rowSums(.)) %>% 
  arrange(desc(Freq)) %>% 
  mutate(Cum=cumsum(Freq)/sum(Freq)) %>%
  select(Terms,Freq,Cum)
#Percentage of trigrams forms around 80% of the trigrams
commonwrd3<-wft3 %>%
  filter(Cum<=0.8)

There were around 1983 trigrams that were removed as they contain profanity terms. Out of the remaining 41613 trigrams, 32963 bigrams representing around 79.21 % of the whole trigrams.

Table 3. The most frequent ten trigrams

Terms Freq Cum
to be a 20 0
to 15 0
some of the 13 0
a lot of 12 0
looking forward to 12 0
one of the 12 0
going to be 11 0
thanks for the 11 0
10 0
i want to 9 0

Figure 3. The most frequent ten trigrams

Plans for creating a prediction algorithm and Shiny app.

I think I still in pre-processing of the data. The 9t most common trigram is an empty space. I think I have to check about tokenization or how I extract profanity words. I am going to 4-gram tokenization. Then, I will implement markov chain to predict words from three, two or one word. I will use a method for smoothing. The app will predict “the”, “I”, “Hi” if no word were used for prediction.