Introduction

There are three files containing words: blog’s posts, news and twits. In this document common exploratory analysis of these files will be described. Also we will create 2-grams (sequence of two words) for each file.

Packages and libraries

We will use tm (https://cran.r-project.org/web/packages/tm/tm.pdf) and RWeka (https://cran.r-project.org/web/packages/RWeka/RWeka.pdf) packages for text mining.

#function for loading necessary package
loadpackage <- function (name) 
{
        if (!require(name, character.only = T)){
                install.packages(name)
                library(package = name, character.only = T)
        }   
}

loadpackage("tm")
loadpackage("RWeka")
loadpackage("stringr")
loadpackage("ggplot2")

Functions

#Function for creating n-grams matrix
tdmGeneration <- function(text, n){
        
        #create corpus
        corpus <- Corpus(VectorSource(text)) 
        
        #transform to lower case
        corpus <- tm_map(corpus, content_transformer(tolower))
        #remove nubers (digits)
        corpus <- tm_map(corpus, removeNumbers) 
        #remove punctuation (,.$: etc.)
        corpus <- tm_map(corpus, removePunctuation)
        #remove too     long    spaces
        corpus <- tm_map(corpus, stripWhitespace)
        #remove stopwords (a,the etc)
        corpus <- tm_map(corpus, removeWords, stopwords("english")) 

        # create n-grams
        NgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = n, max = n)) 
        
        # create n-grams matrix
        tdm <- TermDocumentMatrix(corpus, control = list(tokenize = NgramTokenizer)) 
        
        return(tdm)
}


#function for text reading
ReadTextFromFile <- function(filename){
        file <- file(filename, "r")
        str <- readLines(file, skipNul = TRUE)
        close(file)
        
        return(str)
}

    
#help us to get summary of text
getTextSummary <- function(setname,text){
  rc <- round(length(text)/1e6,3)
  wc <- round(sum(str_count(text,"\\S+"))/1e6,3)
  c(as.character(setname), rc,wc)   
}         

#get n-% random sample of text
getTextSample <- function(str,n){
        str <- str[sample(length(str),round(length(str)*n/100))]
        return(str)
}


drawBarPlot <- function(tdm, setName){

        tdm.matrix <- as.matrix(tdm)
        topwords <- data.frame(words=rownames(tdm.matrix), freq =  rowSums(tdm.matrix))
        rownames(topwords) <- NULL
        topwords <- topwords[order(topwords$freq, decreasing = T),] 
        topwords$words <- reorder(topwords$words,topwords$freq)
        ggplot(head(topwords,15), aes(y = freq, x=words))+
                geom_bar(stat="identity")+
                coord_flip()+
                theme_bw()+
                xlab("2-grams")+
                ylab("Frequency")+
                ggtitle(setName)
}

Summary

Data reading

Original dataset is available here: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Read files and it’s content. *Don’t forget to change your working directory!

tw_str <- ReadTextFromFile("en_US.twitter.txt")
nw_str <- ReadTextFromFile("en_US.news.txt")
bl_str <- ReadTextFromFile("en_US.blogs.txt")

File summary

Now we can see summary of each file.

df <- data.frame(SetName = c("Twitter","News","Blogs"), LinesCount = NA, WordsCount=NA)

tw_sum <- getTextSummary("Twitter",tw_str)
nw_sum <- getTextSummary("News",nw_str)
bl_sum <- getTextSummary("Blogs",bl_str)


df <- rbind(df,tw_sum)
df <- rbind(df,nw_sum)
df <- rbind(df,bl_sum)
df <- df[4:nrow(df),]

#ggplot(df,aes(x=df$SetName,y=as.integer(df$LinesCount)))+geom_bar(stat = #"identity")+xlab("Source")+ylab("Lines count (10^6) ")+theme_bw()

ggplot(df,aes(x=df$SetName,y=as.integer(df$WordsCount)))+geom_bar(stat = "identity")+xlab("Source")+ylab("Lines count (10^6)")+theme_bw()

We can see that “news” has much more less count of words. It is caused by error in source file. This bug must be fixed in future.

N-grams

We can see that there are very big files and very big set of words. So we must reduce these sets. We will take only 1%-sample:

set.seed(42)
tw_sample <- getTextSample(tw_str,1)
nw_sample <- getTextSample(nw_str,1)
bl_sample <- getTextSample(bl_str,1)

Now we can create 2-grams.

tw_tdm <- tdmGeneration(tw_sample, 2)
nw_tdm <- tdmGeneration(nw_sample, 2)
bl_tdm <- tdmGeneration(bl_sample, 2)

Reduce sparsity:

tw_tdm1 <- removeSparseTerms(tw_tdm, 0.9999)
nw_tdm1 <- removeSparseTerms(nw_tdm, 0.9999)
bl_tdm1 <- removeSparseTerms(bl_tdm, 0.999)

We can see the most popular 2-grams for each file. Twitter:

#findFreqTerms(tw_tdm, lowfreq = 50)
drawBarPlot(tw_tdm1, "Twitter")

drawBarPlot(nw_tdm1, "News")

drawBarPlot(bl_tdm1, "Blogs")

Future plans

First of all it will be useful to create table with frequencies of each n-gram (for cases n=1,2,3).

For predicting word after few letters we will use table of unigrams. For predicting second word we will use table of bigrams. For predicting third word we will use table of 3-grams. In these tables we will look for the most popular variant for entered words.

Result may be changed depending on entered letters and words. For example,

If user tape: “I love”
We can see that the most popular third word is “you”.
So we offer it to user.
But if user begin tape third word and the first letter is “d”
We remove our offer and look for new most popular word.
In this case in probably would “dogs”. So we offer it to user.

capstone Project. N-grams

George Fandeev

4 september 2016