There are three files containing words: blog’s posts, news and twits. In this document common exploratory analysis of these files will be described. Also we will create 2-grams (sequence of two words) for each file.
We will use tm (https://cran.r-project.org/web/packages/tm/tm.pdf) and RWeka (https://cran.r-project.org/web/packages/RWeka/RWeka.pdf) packages for text mining.
#function for loading necessary package
loadpackage <- function (name)
{
if (!require(name, character.only = T)){
install.packages(name)
library(package = name, character.only = T)
}
}
loadpackage("tm")
loadpackage("RWeka")
loadpackage("stringr")
loadpackage("ggplot2")
#Function for creating n-grams matrix
tdmGeneration <- function(text, n){
#create corpus
corpus <- Corpus(VectorSource(text))
#transform to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
#remove nubers (digits)
corpus <- tm_map(corpus, removeNumbers)
#remove punctuation (,.$: etc.)
corpus <- tm_map(corpus, removePunctuation)
#remove too long spaces
corpus <- tm_map(corpus, stripWhitespace)
#remove stopwords (a,the etc)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# create n-grams
NgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = n, max = n))
# create n-grams matrix
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = NgramTokenizer))
return(tdm)
}
#function for text reading
ReadTextFromFile <- function(filename){
file <- file(filename, "r")
str <- readLines(file, skipNul = TRUE)
close(file)
return(str)
}
#help us to get summary of text
getTextSummary <- function(setname,text){
rc <- round(length(text)/1e6,3)
wc <- round(sum(str_count(text,"\\S+"))/1e6,3)
c(as.character(setname), rc,wc)
}
#get n-% random sample of text
getTextSample <- function(str,n){
str <- str[sample(length(str),round(length(str)*n/100))]
return(str)
}
drawBarPlot <- function(tdm, setName){
tdm.matrix <- as.matrix(tdm)
topwords <- data.frame(words=rownames(tdm.matrix), freq = rowSums(tdm.matrix))
rownames(topwords) <- NULL
topwords <- topwords[order(topwords$freq, decreasing = T),]
topwords$words <- reorder(topwords$words,topwords$freq)
ggplot(head(topwords,15), aes(y = freq, x=words))+
geom_bar(stat="identity")+
coord_flip()+
theme_bw()+
xlab("2-grams")+
ylab("Frequency")+
ggtitle(setName)
}
Original dataset is available here: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
Read files and it’s content. *Don’t forget to change your working directory!
tw_str <- ReadTextFromFile("en_US.twitter.txt")
nw_str <- ReadTextFromFile("en_US.news.txt")
bl_str <- ReadTextFromFile("en_US.blogs.txt")
Now we can see summary of each file.
df <- data.frame(SetName = c("Twitter","News","Blogs"), LinesCount = NA, WordsCount=NA)
tw_sum <- getTextSummary("Twitter",tw_str)
nw_sum <- getTextSummary("News",nw_str)
bl_sum <- getTextSummary("Blogs",bl_str)
df <- rbind(df,tw_sum)
df <- rbind(df,nw_sum)
df <- rbind(df,bl_sum)
df <- df[4:nrow(df),]
#ggplot(df,aes(x=df$SetName,y=as.integer(df$LinesCount)))+geom_bar(stat = #"identity")+xlab("Source")+ylab("Lines count (10^6) ")+theme_bw()
ggplot(df,aes(x=df$SetName,y=as.integer(df$WordsCount)))+geom_bar(stat = "identity")+xlab("Source")+ylab("Lines count (10^6)")+theme_bw()
We can see that “news” has much more less count of words. It is caused by error in source file. This bug must be fixed in future.
We can see that there are very big files and very big set of words. So we must reduce these sets. We will take only 1%-sample:
set.seed(42)
tw_sample <- getTextSample(tw_str,1)
nw_sample <- getTextSample(nw_str,1)
bl_sample <- getTextSample(bl_str,1)
Now we can create 2-grams.
tw_tdm <- tdmGeneration(tw_sample, 2)
nw_tdm <- tdmGeneration(nw_sample, 2)
bl_tdm <- tdmGeneration(bl_sample, 2)
Reduce sparsity:
tw_tdm1 <- removeSparseTerms(tw_tdm, 0.9999)
nw_tdm1 <- removeSparseTerms(nw_tdm, 0.9999)
bl_tdm1 <- removeSparseTerms(bl_tdm, 0.999)
We can see the most popular 2-grams for each file. Twitter:
#findFreqTerms(tw_tdm, lowfreq = 50)
drawBarPlot(tw_tdm1, "Twitter")
drawBarPlot(nw_tdm1, "News")
drawBarPlot(bl_tdm1, "Blogs")
First of all it will be useful to create table with frequencies of each n-gram (for cases n=1,2,3).
For predicting word after few letters we will use table of unigrams. For predicting second word we will use table of bigrams. For predicting third word we will use table of 3-grams. In these tables we will look for the most popular variant for entered words.
Result may be changed depending on entered letters and words. For example,