The goal of this porject is to build a predictive text mining application. It aims at predicting the next word from the previous number of words. In this report, I showed how I downloaded and loaded the datasets, presented some general characteristics of them, and expalin how I am going to approach the problem to get your valuable feedback.
We load the data using the followin code. The data is available here
The needs special care while loading. “en_US.news.txt” as it may not be read completely
blogs<-readLines("en_US.blogs.txt")
twitter<-readLines("en_Us.twitter.txt",skipNul=TRUE)
con <- file("en_US.news.txt", open="rb")
news <- readLines(con, encoding="UTF-8")
close(con)
rm(con)
s1<-file.size("en_US.blogs.txt")/1024/1024
s2<-file.size( "en_US.news.txt")/1024/1024
s3<-file.size("en_US.twitter.txt")/1024/1024
size<-rbind(s1,s2,s3)
library(R.utils)
n1<-countLines("en_US.blogs.txt")
n2<-countLines( "en_US.news.txt")
n3<-countLines("en_US.twitter.txt")
n<-rbind(n1,n2,n3)
library(magrittr)
library(stringi)
blogsWrds<- stri_count_words(blogs)
twitterWrds<-stri_count_words(twitter)
newsWrds<-stri_count_words(news)
wrds<-rbind(summary(blogsWrds),summary(newsWrds),summary(twitterWrds))
wrds<-as.data.frame(wrds)
wrds<-wrds[,c(3,6)]
totblogs<-sum(blogsWrds);tottwitter<-sum(twitterWrds);totnews<-sum(newsWrds);tot<-rbind(totblogs,totnews,tottwitter)
wrds$tot<-tot
wrds$size<-size
rownames(wrds)<-c("Blogs","News","Twitter")
colnames(wrds)<-c("Mdn W/L", "Max W/L", "Tot W","File size (MB)")
The following table shows that the median number of words per line was the highest in the news dataset, while the maximal number of words was in the blogs dataset. The total number of words was lowest in the Twitter dataset, around 5 000 000 words more in the News dataset; around 8 000 000 words more in the Blogs dataset.
| Mdn W/L | Max W/L | Tot W | File size (MB) | |
|---|---|---|---|---|
| Blogs | 29 | 6,726 | 38,154,238 | 200.4242 |
| News | 32 | 1,796 | 34,762,396 | 196.2775 |
| 12 | 60 | 30,218,166 | 159.3641 |
library(magrittr)
set.seed(1412)
sampleBlogs<- blogs[sample(n1,size=0.0005*n1)]
sampleNews<-news[sample(n2,size=0.0005*n2)]
sampleTwitter<- twitter[sample(n3,size=0.0005*n3)]
dat<-c(sampleBlogs,sampleNews,sampleTwitter) %>% unlist %>% as.character
save(dat,file="sample.RData")
I sampled 0.05% from every dataset. Thus, 449, 505 and 1180 lines are extracted from Blogs, News and Twitter datasets respectively. Then, I combined the three samples together in one setdat of 2134 lines
From this link, I downloaded a list of profanity words to delete them from the Term Data Matrices. I did not remove them at the beginin as they may have an impact on prediction. For example, ‘You are “banned word” man’ is an example for a sentence that contains a profanity word. If I remove the profanity word before tokenization, this sentence will be ‘You are man’ ; and after tokenization, man may be predicted after “You are”. This prediction is wrong. Thus, I decided to remove profanity words after tokenization. In this way, “You are”banned word“” will be removed as an entire block.
Instead of dealing with “line” as the unit, I thought that “sentence” is better choice. In this sentence,“I will go to the park. This is good day!”. If I deal with it as one line, then my model may predict “this” after “the park”. This will be inaccurate prediction, as “this” is the begining of a new sentence and should not be predicted from “the park”. The following code used for breaking the whole sample into ‘r length(dat_sent)’ sentences instead of ‘r length(dat)’ lines
library(openNLP)
library(NLP)
library(magrittr)
library(dplyr)
library(qdap)
library(stringr)
#First, convert the data frame into string
s<-as.String(dat)
r<-str_trim(clean(s))
#Second, I don't know how this code works but it works :)
sent_token_annotator <- Maxent_Sent_Token_Annotator()
a1 <- annotate(s, sent_token_annotator)
## Extract sentences
dat_sent<-s[a1]
Then, I removed any thing but words as the app will predict the next “word”.
dat_sent <- gsub("[^a-zA-Z0-9 ]", "", dat_sent)
save(dat_sent,file="cleansentences.RData")
Tokenization means breaking a stream of text up into words and phrases.I prepared every sentence to be broken three times; every word, every two words and every three words.
library(tm)
library(RWeka)
library(magrittr)
library(dplyr)
### forming corpus and removing unnecessary white spaces
tm_dat <- dat_sent %>%
as.data.frame %>%
DataframeSource %>%
VCorpus%>%
tm_map( stripWhitespace )
rm(dat_sent)
#### Unigrams
# I tokenized the sentences into one-gram tokens after removing punctuation and numbers, then, I converted tdm into a matrix
load("beforeTokenization.RData")
library(tm)
library(magrittr)
library(dplyr)
library(ggplot2)
tdm1<-TermDocumentMatrix(tm_dat,control = list(
removePunctuation = TRUE,
removeNumbers=TRUE,
wordLengths=c(0, Inf))) %>% as.matrix
rm(tm_dat)
## Removing profanity words and empty spaces
profanity<-read.csv("Terms-to-Block.csv",skip=4)[,2]
profanity<-gsub(",","",profanity)
profanity<-c(profanity,"^$")
strings1<-rownames(tdm1)
matches1 <- sapply(profanity, grepl, strings1, ignore.case=TRUE)
selection1<-apply(matches1,1,any)
### Creating word frequency table from Unigrams
wft1<-tdm1 %>% as.data.frame %>%
mutate(Terms=rownames(tdm1),Freq=rowSums(.)) %>%
arrange(desc(Freq)) %>%
mutate(Cum=cumsum(Freq)/sum(Freq)) %>%
select(Terms,Freq,Cum)
#Percentage of words forms around 80% of the whole words
commonwrd1<-wft1 %>%
filter(Cum<=0.90)
There were around 436 unigrams that were removed as they were profanity terms. Out of the remaining 10330 unigrams, 5393 unigrams representing around 52.21 % of the whole unigrams
| Terms | Freq | Cum |
|---|---|---|
| the | 2,277 | 0.05 |
| to | 1,408 | 0.07 |
| and | 1,201 | 0.10 |
| a | 1,169 | 0.12 |
| of | 979 | 0.14 |
| in | 820 | 0.16 |
| i | 720 | 0.17 |
| 621 | 0.19 | |
| that | 537 | 0.20 |
| is | 528 | 0.21 |
### Bigrams
load("beforeTokenization.RData")
library(tm);library(magrittr);library(dplyr)
token2 <- function(x) NGramTokenizer(x, Weka_control(min = 2L, max = 2L))
# I tokenized the sentences into two-gram tokens after removing punctuation and numbers, then, I converted tdm into a matrix
tdm2<-TermDocumentMatrix(tm_dat,control = list(
removePunctuation = TRUE,
removeNumbers=TRUE,
wordLengths=c(0, Inf),
tokenize=token2))%>% as.matrix
rm(tm_dat)
## Removing profanity words and empty spaces
profanity<-read.csv("Terms-to-Block.csv",skip=4)[,2]
profanity<-gsub(",","",profanity)
profanity<-c(profanity,"^$")
strings2<-rownames(tdm2)
matches2 <- sapply(profanity, grepl, strings2, ignore.case=TRUE)
selection2<-apply(matches2,1,any)
### Creating word frequency table from bigrams
wft2<-tdm2 %>% as.data.frame %>%
mutate(Terms=rownames(tdm2),Freq=rowSums(.)) %>%
arrange(desc(Freq)) %>%
mutate(Cum=cumsum(Freq)/sum(Freq)) %>%
select(Terms,Freq,Cum)
#Percentage of bigrams forms around 80% of the bigrams
commonwrd2<-wft2 %>%
filter(Cum<=0.8)
There were around 1374 bigrams that were removed as they contain profanity terms. Out of the remaining 34134 bigrams, 24879 bigrams representing around 73 % of the whole bigrams.
| Terms | Freq | Cum |
|---|---|---|
| in the | 214 | 0.00 |
| of the | 203 | 0.01 |
| on the | 112 | 0.01 |
| to the | 109 | 0.01 |
| to be | 90 | 0.02 |
| for the | 89 | 0.02 |
| at the | 77 | 0.02 |
| and the | 64 | 0.02 |
| in a | 64 | 0.02 |
| it was | 51 | 0.02 |
load("beforeTokenization.RData")
library(tm);library(magrittr);library(dplyr)
token3 <- function(x) NGramTokenizer(x, Weka_control(min = 3L, max = 3L))
# I tokenized the sentences into three-gram tokens after removing punctuation and numbers, then, I converted tdm into a matrix
tdm3<-TermDocumentMatrix(tm_dat,control = list(
removePunctuation = TRUE,
removeNumbers=TRUE,
wordLengths=c(0, Inf),
stripWhitespace=T,
tokenize=token3)) %>% as.matrix
rm(tm_dat)
## Removing profanity words and empty spaces
profanity<-read.csv("Terms-to-Block.csv",skip=4)[,2]
profanity<-gsub(",","",profanity)
profanity<-c(profanity,"^$")
strings3<-rownames(tdm3)
matches3 <- sapply(profanity, grepl, strings3, ignore.case=T)
selection3<-apply(matches3,1,any)
### Creating word frequency table from trigrams
wft3<-tdm3 %>% as.data.frame %>%
mutate(Terms=rownames(tdm3),Freq=rowSums(.)) %>%
arrange(desc(Freq)) %>%
mutate(Cum=cumsum(Freq)/sum(Freq)) %>%
select(Terms,Freq,Cum)
#Percentage of trigrams forms around 80% of the trigrams
commonwrd3<-wft3 %>%
filter(Cum<=0.8)
There were around 1983 trigrams that were removed as they contain profanity terms. Out of the remaining 41613 trigrams, 32963 bigrams representing around 79.21 % of the whole trigrams.
| Terms | Freq | Cum |
|---|---|---|
| to be a | 20 | 0 |
| to | 15 | 0 |
| some of the | 13 | 0 |
| a lot of | 12 | 0 |
| looking forward to | 12 | 0 |
| one of the | 12 | 0 |
| going to be | 11 | 0 |
| thanks for the | 11 | 0 |
| 10 | 0 | |
| i want to | 9 | 0 |
I think I still in pre-processing of the data. The 9t most common trigram is an empty space. I think I have to check about tokenization or how I extract profanity words. I am going to 4-gram tokenization. Then, I will implement markov chain to predict words from three, two or one word. I will use a method for smoothing. The app will predict “the”, “I”, “Hi” if no word were used for prediction.