Summary

In this report we’ll examine a few introductory aspects of text mining in R. We’ll focus on the first steps, including obtaining the data, exploration, sampling and a bit of processing.

Obtaining the data

The data used for this report is provided by Coursera and SwiftKey. The data is a sample of blog, twitter and news texts in multiple languages. However for this report and due to file sizes we’ll stick to the English documents and possibly a subset of the texts.

Exploratory analysis

To better understand what we’ll be working with we need to consider what the data might look like, size of files, number of lines of texts in each file, etc. Here is where we find out how “nimble” we may need to be.

##                           files size.Mbs Ave.Chars.line number.of.lines
## 1   final/en_US/en_US.blogs.txt   200.42          231.7          899288
## 2    final/en_US/en_US.news.txt   196.28            203           77259
## 3 final/en_US/en_US.twitter.txt   159.36           68.8         2360148

You can see that some of the files are quite large and using random subsets, for now, might be a better idea in terms of introductory analyzation and model building.

Pre-processing

A little house keeping here will help us analyse the text documents more effectively. To do this we’ll be using the “tm” package which provides good architecture for text mining. Here is a quick snippet of our raw text sample.

## [1] "Ppl be really dealing with things in their heads like clients I've come across that hear voices & they try to harm themselves"
## [2] "There's just no stopping lucroy. catching popups and shit"                                                                    
## [3] "Isn't there something slightly sinister about three-tined forks?"                                                             
## [4] "Everyone go show some love! Can't wait to hear them play!"                                                                    
## [5] "u can still suck it lol bad outing"                                                                                           
## [6] "You must be taking the flavor train to flavor town."

Here it is again after a little processing

## [1] "ppl really dealing things heads like clients ive come across hear voices try harm "
## [2] "theres just stopping lucroy catching popups shit"                                  
## [3] "isnt something slightly sinister threetined forks"                                 
## [4] "everyone go show love cant wait hear play"                                         
## [5] "u can still suck lol bad outing"                                                   
## [6] " must taking flavor train flavor town"

I won’t go too into detail about the benefits and drawbacks of text manipulating. The main purpose here is to get rid of common words (or even parts of words), extra characters and maybe even numbers… but lets stay on track and not get too caught up in the theory of it yet. Instead, lets take a look at some of the most common words in our twitter snippet after some processing.

Words like “it”, “the”, “a”, etc. are very common but not very useful, so it’s probably a good idea that we got rid of them in this example. Our next step should be clear at this point, word combinations or n-grams and their frequencies as well. These n-grams will most likely be key to developing a good working prediction algorithm. lets take a look using the package RWeka.

Conclusion and looking forward

It seems like a good place to finish this report. We’re trying to make a crisp reader friendly report, if want the nitty gritty I’ll have the code in the appendix. Looking forward we could possibly add “wordnet” library for synonyms. doing so could account for unknown or new words, and as an overall objective, create a leaner, more robust algorithm.
And as always, thanks for looking.

Appendix and code

goals of this report

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far. 4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Downloading data

fileUrl <-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if (!file.exists("./Coursera-SwiftKey.zip")) {
        mydir<- paste0(getwd(),"/","Coursera-SwiftKey.zip")
        download.file(fileUrl, destfile = mydir)
        unzip("./Coursera-SwiftKey.zip")
}

Exploratory

file_list <- list.files(path = "final", # path
                        recursive = TRUE, # this will look into directory
                        pattern = "en_") # only matching pattern "en_" english.

fList <- lapply(paste0("final","/",file_list), function(f){
        fsizeMbs <- file.size(f)/1024/1024
        con <- file(f, open = "r")                  
        Lines <- readLines(con)
        numChars <- lapply(Lines, nchar)
        avChars <- mean(unlist(numChars))
        close(con)                                 
        return(c(f,round(fsizeMbs, 2), round(avChars, 2),length(Lines))) 
    })

df <- data.frame(matrix(unlist(fList), nrow = length(fList), byrow = TRUE))
colnames(df)<- c("files", "size.Mbs", "Ave.Chars.line", "number.of.lines")
print(df)

Sample

if (!file.exists("sampleData/twitSample.txt")){
        dir.create("sampleData")
        conT <- file("final/en_US/en_US.twitter.txt", open = "r")
        twitLines <- readLines(conT)
        numTwitLines <- length(twitLines)
        set.seed(343)
        twitSample <- twitLines[sample(1:numTwitLines, numTwitLines * .1, 
                                       replace = FALSE)]
        close(conT)
        
        dstFileT <- file("sampleData/twitSample.txt")
        twitSample <- iconv(twitSample, "latin1", "ASCII", sub = "")
        writeLines(twitSample, dstFileT)
        close(dstFileT)
}else{        
        conT <- file("sampleData/twitSample.txt", open = "r")
        twitSample <- readLines(conT)
        close(conT)
}

tm package processing

library(tm)
sc<- VCorpus(DirSource("sampleData"))
sc <- tm_map(sc, removePunctuation)
sc <- tm_map(sc, content_transformer(tolower))
sc <- tm_map(sc, removeWords, stopwords("english"))
sc <- tm_map(sc, stripWhitespace)
head(sc[[1]]$content)

Words and plot

library(ggplot2)
dtm <- DocumentTermMatrix(sc)
freqWords <- colSums(as.matrix(dtm))
freqWords <- sort(freqWords,decreasing = TRUE)[1:10]
df <- data.frame(word = names(freqWords), number = freqWords)
g <- ggplot(df, aes(x= reorder(word, -number), y=number))
g <- g + geom_bar(stat = "identity")
g <- g + labs(x = "words", y = "number of occurrences")
g <- g + coord_cartesian(ylim = c(8000, 16000))
g

N-grams and plot

library(RWeka)
gramToke <- function(x){NGramTokenizer(x, Weka_control(min=2, max=2))} # token function
dtmGram <- DocumentTermMatrix(sc, control = list(tokenize=gramToke))
twoGram <- colSums(as.matrix(dtmGram))
twoGram <- sort(twoGram, decreasing = TRUE)[1:5]
twoGramDf <- data.frame(words = names(twoGram),number = twoGram)
g1 <- ggplot(twoGramDf, aes(x= reorder(words, -number), y=number))
g1 <- g1 + labs(x = "2-Gram words", y = "number of occurrences")
g1 <- g1 + geom_bar(stat = "identity")
g1 <- g1 + coord_cartesian(ylim = c(500, 2000))
g1