Overview

This is a milestone report on the progress made in the Data Science capstone course offered by John Hopkins University through the Coursera education platform. The project associated with this capstone will involve applying data science in the area of natural language processing to create a text prediction model.

The R computing environment along with R Studio’s IDE will be used to develop the model. The R Studio Shiny Apps extension will be used to demonstrate the model as a web application.

# Load libraries needed
library(NLP)
library(tm)
library(RWeka)
library(compiler)
library(rpart)
library(SnowballC)
library(ggplot2)

# Initialize variables
datDir <- "../../data/raw/"
corpus <- vector("list", 3)
tdm <- vector("list", 3)
ilist <- list(c("Size (Bytes)", "Line Count", "Longest Line (Chars)", "Shortest Line (Chars)", 
    "Mean Line (Chars)", "Longest Line (Words)", "Shortest Line (Words)", "Mean Line (Words)", 
    "Size (Bytes)", "Line Count", "Longest Line (Chars)", "Shortest Line (Chars)", 
    "Mean Line (Chars)", "Longest Line (Words)", "Shortest Line (Words)", "Mean Line (Words)"), 
    c("Twitter", "Blogs", "News", "All"))
outp <- matrix(rep(1, 64), ncol = 4, byrow = TRUE, dimnames = ilist)

In the text prediction model I also plan to predict the correct spelling words entered into it. I created my own spell checker that is based on the most commonly misspelled English words. The data source for this was from Wikipedia. I am still trying to find similar lists for Finnish, German and Russian.

# Read common mispelled words into table
misspell <- read.table(paste(datDir, "common-misspelled-words-en_us.txt", sep = ""), 
    sep = ">", quote = "\"", header = F)

# Process table
for (i in 1:ncol(misspell)) {
    misspell[, i] <- tolower(misspell[, i])
    misspell[, i] <- gsub("-", "", misspell[, i])
}

R Performance

Working with large datasets combined with running R on an older system with below average computing power, has forced me to try to optimize the code to improve performance. The primary technique I implemented, was to compile as many R user functions as possible. In some cases this can speed up functions by a factor of 3 or 4.

# Compile slow functions
cf_readLines <- cmpfun(function(x) {
    return(readLines(x, skipNul = TRUE))
})

cf_spellChk <- cmpfun(function(x) {
    x <- gsub("[^a-zA-Z0-9 [:punct:]]", "", x)
    for (i in 1:length(x)) {
        if (x[i] %in% misspell[, 1]) {
            j <- match(x[i], misspell[, 1])
            x[i] <- misspell[j, 2]
        }
    }
    return(x)
})

cf_tm_reduce <- cmpfun(function(x) {
    skipWords <- function(y) removeWords(y, stopwords("english"))
    tmfuns <- list(stemDocument, stripWhitespace, skipWords, removeNumbers, 
        removePunctuation, tolower, cf_spellChk)
    return(tm_reduce(x, tmfuns))
})

Dataset

The training data set was obtained from SwiftKey at the following Coursera Link. The data set consist of news feeds, blog entries, and twitter feeds in four different languages: English, German, Russian and Finnish.

Data Acquisition

The data was downloaded as an compressed archive file. Decompromising the archive file reviled 12 data files. The following code was used to import the English-based data files into the R computing environment.

for (i in 1:3) {
    # Read corpus out of data files
    corpus[[i]] <- cf_readLines(paste(datDir, "SwiftKey/corpus-en_us-", ilist[[2]][i], 
        ".txt", sep = ""))
}
## Warning in readLines(x, skipNul = TRUE): incomplete final line found on
## '../../data/raw/SwiftKey/corpus-en_us-News.txt'
for (i in 1:3) {
    corpra <- corpus[[i]]
    
    # Collect statistics on raw data
    outp[1, ilist[[2]][i]] <- object.size(corpra)
    outp[2, ilist[[2]][i]] <- length(corpra)
    vch <- nchar(corpra[1:length(corpra)])
    outp[3, ilist[[2]][i]] <- max(vch)
    outp[4, ilist[[2]][i]] <- min(vch)
    outp[5, ilist[[2]][i]] <- mean(vch)
    vch <- sapply(gregexpr("[[:alpha:]]+", corpra), function(x) sum(x > 0))
    outp[6, ilist[[2]][i]] <- max(vch)
    outp[7, ilist[[2]][i]] <- min(vch)
    outp[8, ilist[[2]][i]] <- mean(vch)
    
    # Clean-up
    corpus[[i]] <- corpra
    rm(vch, corpra)
}
outp[1, 4] <- sum(outp[1, 1:3])
outp[2, 4] <- sum(outp[2, 1:3])
outp[3, 4] <- max(outp[3, 1:3])
outp[4, 4] <- min(outp[4, 1:3])
outp[5, 4] <- mean(outp[5, 1:3])
outp[6, 4] <- max(outp[6, 1:3])
outp[7, 4] <- min(outp[7, 1:3])
outp[8, 4] <- mean(outp[8, 1:3])

Data Cleaning

The raw data contained many issues that could make developing a text prediction model difficult. These issues include non-English characters, special characters, numbers, punctuation, misspellings, mix of lower/uppercase and unnecessary white spaces.

The tm package was used to perform transformations and mappings on the raw data to address the issues. The tm package had built in transformation functions to deal with all the issues except for misspellings. As prevously mentioned, I used my own spell checker.

# Process data
for (i in 1:3) {
    # Apply transformations
    corpra <- cf_tm_reduce(corpus[[i]])
    
    # Collect statistics on processed data
    outp[9, ilist[[2]][i]] <- object.size(corpra)
    outp[10, ilist[[2]][i]] <- length(corpra)
    vch <- nchar(corpra[1:length(corpra)])
    outp[11, ilist[[2]][i]] <- max(vch)
    outp[12, ilist[[2]][i]] <- min(vch)
    outp[13, ilist[[2]][i]] <- mean(vch)
    vch <- sapply(gregexpr("[[:alpha:]]+", corpra), function(x) sum(x > 0))
    outp[14, ilist[[2]][i]] <- max(vch)
    outp[15, ilist[[2]][i]] <- min(vch)
    outp[16, ilist[[2]][i]] <- mean(vch)
    
    # Clean-up
    corpus[[i]] <- corpra
    rm(vch, corpra)
}
outp[9, 4] <- sum(outp[9, 1:3])
outp[10, 4] <- sum(outp[10, 1:3])
outp[11, 4] <- max(outp[11, 1:3])
outp[12, 4] <- min(outp[12, 1:3])
outp[13, 4] <- mean(outp[13, 1:3])
outp[14, 4] <- max(outp[14, 1:3])
outp[15, 4] <- min(outp[15, 1:3])
outp[16, 4] <- mean(outp[16, 1:3])

Term-Document Matrix

The plan is to convert the clean data into Term-Document Matrix. A Term-Document Matrix is a matrix that shows the frequency of a term over a number of documents. In this case, a document will be an entry from either a Twitter feed, blog account, or news feed.

cf_GramTokenizer <- cmpfun(function(x) {
    NGramTokenizer(x, Weka_control(min = 2, max = 3))
})
cf_VecSource <- cmpfun(function(x) {
    SimpleSource(length = length(x), content = as.character(x), class = "VecSource")
})
getElem.VecSource <- cmpfun(function(x) {
    list(content = x$content[x$position], uri = NULL)
})
pGetElem.VecSource <- cmpfun(function(x) {
    lapply(x$content, function(y) list(content = y, uri = NULL))
})

set.seed(1989)
for (i in 1:3) {
    samp <- sample(corpus[[i]], 500)
    vcorpus <- VCorpus(cf_VecSource(samp), list(reader = readPlain))
    vcorpus <- tm_map(vcorpus, removeWords, stopwords("en"))
    tdm[[i]] <- TermDocumentMatrix(vcorpus, control = list(tokenize = cf_GramTokenizer))
}

Exploratory Analysis

For the exploratory analysis I will generate summaries of the three files read into the R environment. The summaries will include Word counts, line counts and basic data tables.

Graphics Setup

To do the analysis requires some setup in the R environment. The following code was taken from R Graphics Cookbook1. This code allows multiple plots with ggplot routines.

# Multiple plot function ggplot objects can be passed in ..., or to plotlist
# (as a list of ggplot objects) - cols: Number of columns in layout -
# layout: A matrix specifying the layout. If present, 'cols' is ignored.  If
# the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE), then
# plot 1 will go in the upper left, 2 will go in the upper right, and 3 will
# go all the way across the bottom.
multiplot <- function(..., plotlist = NULL, file, cols = 1, layout = NULL) {
    library(grid)
    # Make a list from the ... arguments and plotlist
    plots <- c(list(...), plotlist)
    numPlots = length(plots)
    # If layout is NULL, then use 'cols' to determine layout
    if (is.null(layout)) {
        # Make the panel ncol: Number of columns of plots nrow: Number of rows
        # needed, calculated from # of cols
        layout <- matrix(seq(1, cols * ceiling(numPlots/cols)), ncol = cols, 
            nrow = ceiling(numPlots/cols))
    }
    if (numPlots == 1) {
        print(plots[[1]])
    } else {
        # Set up the page
        grid.newpage()
        pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
        # Make each plot, in the correct location
        for (i in 1:numPlots) {
            # Get the i,j matrix positions of the regions that contain this subplot
            matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
            print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row, layout.pos.col = matchidx$col))
        }
    }
}

# Data frame to hold the statistical info
df <- data.frame(Type = factor(c("Twitter", "Blogs", "News", "All"), levels = c("Twitter", 
    "Blogs", "News", "All")), Total_Bytes = outp[1, ], Total_Lines = outp[2, 
    ], Max_Chars_per_Line = outp[3, ], Min_Chars_per_Line = outp[4, ], Avg_Chars_per_Line = outp[5, 
    ], Max_Words_per_Line = outp[6, ], Min_Words_per_Line = outp[7, ], Avg_Words_per_Line = outp[8, 
    ], Total_Bytes2 = outp[9, ], Total_Lines2 = outp[10, ], Max_Chars_per_Line2 = outp[11, 
    ], Min_Chars_per_Line2 = outp[12, ], Avg_Chars_per_Line2 = outp[13, ], Max_Words_per_Line2 = outp[14, 
    ], Min_Words_per_Line2 = outp[15, ], Avg_Words_per_Line2 = outp[16, ])

Summary Statistics

This section will display the results of the exploratory analysis. In the analysis I will refer to a data source as the corpus of documents or entries coming from either: a Twitter feed, a Blog account or news feed. The first analysis will be of the raw data, then we will look at the analysis of the processed data.

Raw Data

The raw data is the unprocessed data read from the downloaded data files.

knitr::kable(outp[1:8,], caption="Table 1: Raw Data Statistics", digits=0);
Table 1: Raw Data Statistics
Twitter Blogs News All
Size (Bytes) 316037600 260564320 20111392 596713312
Line Count 2360148 899288 77259 3336695
Longest Line (Chars) 213 40835 5760 40835
Shortest Line (Chars) 2 1 2 1
Mean Line (Chars) 69 232 203 168
Longest Line (Words) 62 6454 640 6454
Shortest Line (Words) 1 0 0 0
Mean Line (Words) 13 42 35 30

Some of the statistics on the raw data is shown in the figures generated by the following R code.

p1 <- ggplot(data=df, aes(x=Type, y=Total_Bytes, fill=Type)) + 
        geom_bar(colour="black", stat="identity") + 
        xlab("Source") + ylab("Size (Bytes)") +
        ggtitle("Total Size of Data Source");
p2 <- ggplot(data=df, aes(x=Type, y=Total_Lines, fill=Type)) + 
        geom_bar(colour="black", stat="identity") + 
        xlab("Source") + ylab("Entry") +
        ggtitle("Total Number of Entries in Data Source");
p3<- ggplot(data=df, aes(x=Type, y=Max_Chars_per_Line, fill=Type)) + 
        geom_bar(colour="black", stat="identity") + 
        xlab("Source") + ylab("Characters") +
        ggtitle("Maximum Characters per Entry");
p4 <- ggplot(data=df, aes(x=Type, y=Min_Chars_per_Line, fill=Type)) + 
        geom_bar(colour="black", stat="identity") + 
        xlab("Source") + ylab("Characters") +
        ggtitle("Minimum Characters per Entry");
p5<- ggplot(data=df, aes(x=Type, y=Avg_Chars_per_Line, fill=Type)) + 
        geom_bar(colour="black", stat="identity") + 
        xlab("Source") + ylab("Characters") +
        ggtitle("Average Characters per Entry");
p6<- ggplot(data=df, aes(x=Type, y=Max_Words_per_Line, fill=Type)) + 
        geom_bar(colour="black", stat="identity") + 
        xlab("Source") + ylab("Characters") +
        ggtitle("Maximum Words per Entry");
p7 <- ggplot(data=df, aes(x=Type, y=Min_Words_per_Line, fill=Type)) + 
        geom_bar(colour="black", stat="identity") + 
        xlab("Source") + ylab("Characters") +
        ggtitle("Minimum Words per Entry");
p8<- ggplot(data=df, aes(x=Type, y=Avg_Words_per_Line, fill=Type)) + 
        geom_bar(colour="black", stat="identity") + 
        xlab("Source") + ylab("Characters") +
        ggtitle("Average Words per Entry");

multiplot(p1, p2, p3, p4, p5, p6, p7, p8, cols=2);

Unprocessed Data

Processed Data

The processed data is the data that has had transformations applied to it, thus making it more usable for model building. The transformations that were applied:

  • Strip out Whitespace
  • Change to Lowercase
  • Remove Numbers
  • Remove Punctuations
  • Replace Misspelled Words
knitr::kable(outp[9:16,], caption="Table 2: Processed Data Statistics", digits=0);
Table 2: Processed Data Statistics
Twitter Blogs News All
Size (Bytes) 254000872 183449160 15444944 452894976
Line Count 2360148 899288 77259 3336695
Longest Line (Chars) 140 29702 3256 29702
Shortest Line (Chars) 0 0 0 0
Mean Line (Chars) 46 147 139 111
Longest Line (Words) 47 3916 533 3916
Shortest Line (Words) 0 0 0 0
Mean Line (Words) 7 22 20 16

Some of the statistics on the processed data is shown in the figures below.

p1 <- ggplot(data=df, aes(x=Type, y=Total_Bytes2, fill=Type)) + 
        geom_bar(colour="black", stat="identity") + 
        xlab("Source") + ylab("Size (Bytes)") +
        ggtitle("Total Size of Data Source");
p2 <- ggplot(data=df, aes(x=Type, y=Total_Lines2, fill=Type)) + 
        geom_bar(colour="black", stat="identity") + 
        xlab("Source") + ylab("Entry") +
        ggtitle("Total Number of Entries in Data Source");
p3<- ggplot(data=df, aes(x=Type, y=Max_Chars_per_Line2, fill=Type)) + 
        geom_bar(colour="black", stat="identity") + 
        xlab("Source") + ylab("Characters") +
        ggtitle("Maximum Characters per Entry");
p4 <- ggplot(data=df, aes(x=Type, y=Min_Chars_per_Line2, fill=Type)) + 
        geom_bar(colour="black", stat="identity") + 
        xlab("Source") + ylab("Characters") +
        ggtitle("Minimum Characters per Entry");
p5<- ggplot(data=df, aes(x=Type, y=Avg_Chars_per_Line2, fill=Type)) + 
        geom_bar(colour="black", stat="identity") + 
        xlab("Source") + ylab("Characters") +
        ggtitle("Average Characters per Entry");
p6<- ggplot(data=df, aes(x=Type, y=Max_Words_per_Line2, fill=Type)) + 
        geom_bar(colour="black", stat="identity") + 
        xlab("Source") + ylab("Characters") +
        ggtitle("Maximum Words per Entry");
p7 <- ggplot(data=df, aes(x=Type, y=Min_Words_per_Line2, fill=Type)) + 
        geom_bar(colour="black", stat="identity") + 
        xlab("Source") + ylab("Characters") +
        ggtitle("Minimum Words per Entry");
p8<- ggplot(data=df, aes(x=Type, y=Avg_Words_per_Line2, fill=Type)) + 
        geom_bar(colour="black", stat="identity") + 
        xlab("Source") + ylab("Characters") +
        ggtitle("Average Words per Entry");

multiplot(p1, p2, p3, p4, p5, p6, p7, p8, cols=2);

Processed Data

Observations

Processing the data helped reducing the overall size of corpuses.

Next Step

The next step will be convert each of the three Corpus into a Term-Document Matrix. In a Term-Document Matrix, the terms are the rows and the documents are in the columns. At that point furher transforamtions would be applied. Such as:

The Term-Document Matrix will be the foundation of understanding word frequecies, predictive weighting and building the overall text prediction model.

Plans for a Prediction Algorithm

From the term document matrices, I will use the “caret” package to train a prediction model.

References

  1. Chang, Winston; “R Graphics Cookbook”“; O’Reilly Media, Inc; 2013; Page 49-69.
  2. Xie, Yihui; “Dynamic Documents with R and knitr”; Taylor & Francis Group, LLC; 2013