The data were downloaded from the course website and unzipped to extract the English database as a corpus. Three text documents from the twitter, blog and news were found with each line standing for a message. The dataset contains news, blogs and tweets in four different languages, English, German, Russian and Finish. The english Corpus has three datasets, with the following statistics: Twitter: Small sentence(s), maximum number of characters observed is 213. There are 167 million characters, in 30 million words, in 2.3 million tweets. Blogs: Paragraphs. Multiple sentences per blog. Largest sentence has 40835 characters. In total, this dataset has 37 million words in less than a million lines. News: Paragraphs. Multiple sentences. Largest sentence has 11384 characters, total words are 30 million in 1 million lines.
library(ggplot2)
library(stringr)
library(tm)
library(R.utils)
library(RTextTools)
library(wordcloud)
filelist = c("en_US.twitter.txt","en_US.news.txt","en_US.blogs.txt")
getsummary_file = function(filename){
con <- file(paste("final//en_US//",filename,sep = ""), "r")
nlines <- countLines(file = paste("final//en_US//",filename,sep = ""))
text.data <- readLines(con,n = nlines,encoding = "UTF-8", warn = FALSE)
nchar.perline <- unlist(lapply(X = text.data, FUN = nchar))
longest.line <- max(nchar.perline)
shortest.line <- min(nchar.perline)
median.line <- median(nchar.perline)
n.sentences <- sum(str_count(text.data,"[\\.|?]"))
word.list = strsplit(text.data, "\\W+", perl=TRUE)
words <- unlist(word.list)
word.count <- length(words)
unique.words <- length(unique(words))
top10.words <- head(sort(table(words) , decreasing=TRUE),10)
summary.text <- cbind(filename,nlines,longest.line, shortest.line,median.line, n.sentences, word.count)
# names(summary.text) = c(filename,"longest.line","shortest.line","median.line",
# "no.of.sentences","no.of.words")
print("top 10 words are: ")
print(top10.words)
print("file summary")
print(summary.text)
hi.df <- data.frame(nchar = nchar.perline)
m <- ggplot(hi.df, aes(x = nchar))+ggtitle(paste("density plot number of characters per line in",filename))
m <- m + geom_density()
print(m)
close(con)
return(text.data)
}
tweets <-getsummary_file(filelist[1])
## [1] "top 10 words are: "
## words
## the I to a you and for in of is
## 842294 804214 770738 578042 522523 405729 373100 360568 351926 339363
## [1] "file summary"
## filename nlines longest.line shortest.line median.line
## [1,] "en_US.twitter.txt" "2360148" "140" "2" "64"
## n.sentences word.count
## [1,] "3020319" "31150908"
News
tweets <-getsummary_file(filelist[2])
## [1] "top 10 words are: "
## words
## the to and a of in s that for
## 1720341 898055 857242 844540 771103 633110 418779 341488 337611
## is
## 281764
## [1] "file summary"
## filename nlines longest.line shortest.line median.line
## [1,] "en_US.news.txt" "1010242" "11384" "1" "185"
## n.sentences word.count
## [1,] "2209839" "35793026"
Blogs
tweets <-getsummary_file(filelist[3])
## [1] "top 10 words are: "
## words
## the to and I of a in that is
## 1669721 1055462 1036035 889792 868442 865336 555938 459389 426408
## it
## 382723
## [1] "file summary"
## filename nlines longest.line shortest.line median.line
## [1,] "en_US.blogs.txt" "899288" "40833" "1" "156"
## n.sentences word.count
## [1,] "2294984" "38378182"
The whole tokenization is aiming at removing meaningless characters and the words with low frequency in the corpus. The final corpus will show the words or n-gram with a high frequency which will be helpful for exploring the relationship between the words and building a manful statistical model.
So i cleaned the data with removing the ASCII characters,changing the capital characters to lower case,removing the punctuation numbers and stop words and stemming the left words. To decrease the spares of the term frequency, I removed the terms occurred less than ten times in the whole document to get the final corpus. Let me first do Twitter data:
index <- as.logical (rbinom (n = length (tweets), size = 1, prob = 0.10))
tweets <- tweets [index]
dat = grep("tweets",iconv(tweets,"latin1","ASCII",sub = "tweets"))
tweets = tweets[-dat]
# create a corpus
tweets_corpus <- Corpus (VectorSource(tweets))
sc <- tm_map(tweets_corpus, removeNumbers)
sc <- tm_map(sc, removePunctuation)
sc <- tm_map(sc, tolower)
sc <- tm_map(sc, removeWords, stopwords("english"))
sc <- tm_map(sc, stripWhitespace)
myCorpus <- tm_map(sc, PlainTextDocument)
wordcld <- wordcloud (myCorpus,
scale=c(5,0.5),
max.words=200,
random.order=FALSE,
rot.per=0.35,
use.r.layout=FALSE,
colors=brewer.pal(8, 'Dark2'))
Selection of most suited NLP algorithm and package
Accent filtering
Profanity substituion such that the context of the sentence remains intact
Build and experiment with the size, performance of different n-gram models
Shiny deloyment