Introduction

Our goal is to produce a functioning data product to predict the next likely word given an initial phrase. We began this task on March 7th, this report outlines progress to date and planned steps for successful completion by April 24th. We were supplied with three large text files from three internet sources: tweets, blogs and news, and tasked with completing the following for this report:

The results of these tasks are shown below with appropriate discussion. The final section lays out the remaing major steps to finish the project. Two of those steps are to better understand memory use and data size limitations and to select which of several R packages to use in building and modeling our data. An appendix with references and code completes the report.

Extract, Transform, Load (ETL)

Basic metrics about the raw files are shown below. We use system functions to produce the raw numbers and R functions to build the tables. For now we ignore issues such as “what is a word?”

In order to work with the data more efficiently we built samples and chose to select them proportionally based on their contribution to total lines. In addition we limited it to a 10% sample. The second table shows the sample metrics.

The twitter sample is still unweidly so our file read function for this report currently reads only the first 100,000*(%ofTotLines) from each file.

Table 1. Metrics for Assigned Data

Source Num Lines % of Tot Lines Words per Line Longest Line
en_US.blogs.txt 899,288 21.1 % 41.4 40,832
en_US.news.txt 1,010,242 23.7 % 34.0 11,384
en_US.twitter.txt 2,360,148 55.3 % 12.9 140
total 4,269,678 100.0 % 23.9 40,832

Total words can be computed from lines and words per line.

Table 2. Metrics for File Samples

Source Num Lines % of Tot Lines Words per Line Longest Line
blog.train.txt 89,929 21.1 % 41.4 37,166
news.train.txt 101,024 23.7 % 33.9 2,625
twit.train.txt 236,015 55.3 % 12.8 140
total 426,968 100.0 % 23.9 37,166

The fact that twitter has the most relative lines seems logical and that news and blog have similar numbers of lines. Also, we aren’t surprised to see blogs have a higher average words per line. Line length is something to be explored in a more advanced level project. However, it is good to see that the max twitter line is 140.

Currently we are working with each file independently trying to understand how the different sources look in relation to each other. A language prediction machine needs to be able to handle the unexpected so we will need to combine all of the data. But it is also fun to imagine to be able to predict for someone on a formal or informal basis. News might be better for formal language and tweets could be a gold mine of slang.

Corpus Creation and Cleaning

To work with text in a meaningful way we need to restructure it. Listing and counting occurences of words is a logical and “easy” first step (maybe not fun without a computer). Hidden code loads the libraries and data samples we will work with for the rest of the report.

Before using some of the more formal tools we go through the text and remove foreign characters by keeping only all english alpha numeric (a-z, A-Z and 0-9) as well as standard punctuation ( ’,-:;.!? ).

Tokenization and Filtering

We are using the R package tm at this time. A simple Corpus() command turned our text data into this special structure. Now we use tm_map and many built in operations. After some experimentation cleaning is done as follows:

  1. Removal of stopwords like “be”, “on” or “he”.
  2. Tranformation to lower case.
  3. Remove punctuation and then remove numbers.
  4. Strip white space.
  5. Remove profanity using a list we supply.
  6. Word stemming.

Below we print two lines from the twitter file and one line each from the blog and news files. First is the line as we imported it and second is the corpus entry for that line.


List 1: Raw Text Compared to Corpus

Twitter
## Is his eardrum, on his fifty he be like  What? I can't hear you.  lol
## list(content = "is eardrum fifti like what i hear lol", meta = list(author = character(0), datetimestamp = list(sec = 13.0866029262543, min = 25, hour = 23, mday = 19, mon = 2, year = 116, wday = 6, yday = 78, isdst = 0), description = character(0), heading = character(0), id = "4", language = "en", origin = character(0)))

After the first two hashes is the raw tweet: Is his eardrum, on his fifty he be like What? I can’t hear you. lol

After the second set of hashes, after “content =” is the cleaned tweet: is eardrum fifti like what i hear lol

As can be seen we have removed the punctuation, changed the case, removed words such as “on” and “be” and changed “fifty” to the stemmed “fifti.”

News
## The National Immigration Law Center, an immigrant advocacy group, says mandating its use nationwide would force millions of people to correct false information in the system or lose their jobs. It would put nearly 800,000 people out of work because of database errors. What's more, E-Verify cannot detect identity fraud.
## list(content = "the nation immigr law center immigr advocaci group say mandat use nationwid forc million peopl correct fals inform system lose job it put near peopl work databas error what everifi detect ident fraud", meta = list(author = character(0), datetimestamp = list(sec = 19.4614140987396, min = 25, hour = 23, mday = 19, mon = 2, year = 116, wday = 6, yday = 78, isdst = 0), description = character(0), heading = character(0), id = "100", language = "en", origin = character(0)))

This is an example from the news file. Note that the number 800,000 does not occur in the cleaned version.


Word Frequencies

Visualizing our information can help us understand it better, especially something as unweildy as chunk of text. From the corpus we create the term document matrix, which is what it sounds like, terms are rows, document ids are columns and the entries are counts of those terms appearing in the documents.

Such objects are also likely to be highly sparse and so we take some steps to reduce the matrix to something more usable.


Graphic 1. Up to 20 Words Occuring at Least 200 Times

———- Twitter ————————– Blog ————————- News (exc. “said”) —–

The word cloud gives a quick visual of the relative frequency of words typically through size and color. In trying to reduce the upper and lower margins of the chart we found discussion of the shortcomings of word clouds (see for example references ). Exploratory analysis is about trying to look at our data and visualizations are a great way to do that.

Here the goal was just to get a sense of the common words in each source to literally see which stand out as being the same or different. “Will”, “just”, “time”, “peopl” and “like” appear in all three. From twitter we note that in this sample a positive sentiment stands out: “love”, “thank” and “good.” In some of our exploration this wasn’t the case. Many words appear frequently in two of the three sources e.g. “year”, “work”. Finally, it was notable that “citi” appears in the top from the news source and the words “play” and “game” were unexpected until we remembered the sports news.

Frankly, a table of words by source might convey things better–for one we would have the counts–recall each source does not contribute equally to the sample. The link given above shows an interesting modification to the word cloud.

It should be noted that we don’t know if the frequent occurence of “like” is due to the existence of the “like” button. It is likely. In addition, we need to check the news for author, date and other tagline information that should be accounted for.

N-Grams

Above we created unigrams collections of single words. Now we create bigrams, collections of word pairs. For the next step in our project we will need to create trigrams and then use the n-grams we have created to build our initial prediction algorithm.

The example words given highlight issues when working with language (versus raw text). Is “will” a legal document or “I will get the milk”? Does the news tell us about “you have to play the game to get ahead” or “that was the play of the game!”

Graphic 2. Top 20 Bi-Grams for Each Source

———- Twitter ————————– Blog ————————– News ——–

Although we took some steps to clean out “i ve” and most of the “said i” and “i said” type pairs we still have a number of terms that seem redundant or potentially of little value. In hindsight, leaving in certain punctuation marks might be worthwhile or doing more word transformations, so that “I’ve” becomes “I have” might be in order.

The occurence of the pronoun “I” isn’t to be wondered at in tweets and blog posts. In our news corpus we are still plagued with words that most likely come from sources quoted in the story. Really, we just wanted some interesting word pairs to look at, as shown by printing the last 10 pairs from each source below.

##      will see     will take      year old great weekend        haha i 
##            23            23            23            22            22 
##         i bet    just watch     much love         whi i    will never 
##            22            22            22            22            22
##      s like  say someth    the fact   the stori   this will    wait see 
##          24          24          24          24          24          24 
##   work hard     world i yesterday i     you see 
##          24          24          24          24
## obama administr     pretti good        the next         two day 
##              57              57              57              57 
##   unemploy rate       want make      offic said         said im 
##              57              57              56              56 
##      said today        tri make 
##              56              56

Discussion

Working with a body of text is different from the usual row and column data. Fortunately, for our task we can turn this data into a structure we can understand–n-grams and document-term or term-document matrices. The next step is to combine the data from our sources and build a word prediction model.

Some of our results so far leave a little to be desired but most of them were obtained in the context of putting together this report. Certainly we have a better understanding of what we have to work with, the steps to clean and “tokenize” the data into a form we will use to build a model that appears to understand phrases.

Next Steps

In order to meet the standards and due date for the project we plan to:

  1. Review the quanteda package we have come across in our research and in class discussion.
  2. Review the tools for understanding memory use and processing.
  3. Create a rudimentary predictive model.
    A. Logically we can think of using our 1-2-3-gram tables as look-up tables for filling in the next word in a phrase using word frequencies to create probabilities for what the next word would be. Given “happy new” we want to pick “year” and not “york.”
    B. We need our method to attempt to handle unknown word combinations. If we don’t have the bigram “natural language” would our machine offer up “processing” as the next word?
  4. We need to get a model up to test the Shiny app for memory size and technical issues as soon as possible.
  5. QC the entire thing.
  6. Create the pitch and deploy the app!

We hope to get through the first 3 above before the end of week three in following the syllabus. Getting something in place and refining it is better than trying to create something perfect.

Appendix

Glossary

  • Corpus: collection of documents
  • n-grams: word phrase collection- uni - single words, bi - word pairs, tri - triplets, etc.
  • stemming: reducing words to their root for, e.g. stemming becomes stem
  • stopwords: most common words in a language
  • term document matrix: grid of terms by documents with term frequencies as the body. Alternatively a document term matrix can be constructed with documents as the rows

References

From the Course notes or lectures directly:
Additional References (discovered via google search unless otherwise noted):

R code

Functions - multifunc.R
#####################################################
## Data Science Capstone: Milestone Report
## milefuncts.R
## Jesse Sharp
## 2016-03-19
## functions used in CDSCmilestone.Rmd
#####################################################

# Read Samples
loadsamp <- function(x,pt){con1 <- file(x, "rb")
y <- readLines(con1, 100000*pt, skipNul=TRUE) # read 100,000 lines
close(con1) #close the connection
return(y)
}

# clean text
charfixer <- function(x){x <- gsub("[^a-zA-Z0-9',-:;.!?]", " ", x)}

wordfixer <- function(x){x <- gsub("He'll", "he will", x)
x <- gsub("he'll", "he will", x)
x <- gsub("i'll", "i will", x)
x <- gsub("I'll", "i will", x)
x <- gsub("I've", "i have", x)
x <- gsub("i've", "i have", x)
x <- gsub("it's", "its", x) # it is or it's?
return(x)
}

## bad words
conb <- file("C:/StatWare/Rprog/Capstone/data/badwords.txt", "r")
badwords <- readLines(conb, skipNul=TRUE)
close(conb)

## corpus cleaning function
corpCleaner <- function(x){
    x <- tm_map(x, removeWords, stopwords("english"))
    x <- tm_map(x, content_transformer(tolower))
    x <- tm_map(x, removePunctuation)
    x <- tm_map(x, removeNumbers)
    x <- tm_map(x, stripWhitespace)
    x <- tm_map(x, removeWords, badwords)
    # d. stemming - optional
    x <- tm_map(x, stemDocument)
    return(x)
}


# Printing samples
prntdoc <- function(x,y,z) {
    #see the contents of xth document
    RAWTXT <- x[z]
    corp1 <- as.character(y$content[z])
    #  return(print(rbind(RAWTXT, corp1)))
    return(writeLines(rbind(RAWTXT, corp1)))
}

# Multi-plot function
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
    library(grid)
    # Make a list from the ... arguments and plotlist
    plots <- c(list(...), plotlist)
    numPlots = length(plots)

    # If layout is NULL, then use 'cols' to determine layout
    if (is.null(layout)) {
        # Make the panel
        # ncol: Number of columns of plots
        # nrow: Number of rows needed, calculated from # of cols
        layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                         ncol = cols, nrow = ceiling(numPlots/cols))
    }

    if (numPlots==1) {
        print(plots[[1]])

    } else {
        # Set up the page
        grid.newpage()
        pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
        # Make each plot, in the correct location
        for (i in 1:numPlots) {
            # Get the i,j matrix positions of the regions that contain this subplot
            matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

            print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                            layout.pos.col = matchidx$col))
        }
    }
}


# Cleveland plot credit: https://rpubs.com/escott8908/RGC_Ch3_Gar_Graphs
plotTop <- function(z){x <- ggplot(z, aes(x=log(counts), y=reorder(bigram, counts))) +
    geom_point(size = 3) +
    theme_bw() +
    theme(panel.grid.major.x = element_blank(),
          panel.grid.minor.x = element_blank(),
          panel.grid.major.y = element_line(color='grey60', linetype='dashed'))
return(x)
}
Data Reading and Sampling - appendix.R
#####################################################
## Data Science Capstone: Milestone Report
## appendix.R
## Jesse Sharp
## 2016-03-19
## data reading/sampling used in CDSCmilestone.Rmd
#####################################################

# Appendix R Code
# packages we have installed so far
install.packages("tm")
install.packages("worldcloud")
install.packages("SnowballC")
install.packages("rJava")
install.packages("RWeka")

# Task: confirm ability to read the data
datURL <- ("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-Swiftkey.zip")
datFile <- ("../Coursera-Swiftkey.zip")
getwd()
setwd("C:/StatWare/Rprog/Capstone")

#datFile1 <- ("./data/filename")

if (!file.exists("./data")) {
    dir.create("./data")
}
if (!file.exists(datFile)) {
    download.file(datURL, destfile=datFile)
    unzip(datFile)
}

# Read in all the data - use "rb" - binary is a good default even for text
# (instead of being more complete and writing a function to sample the data as it is
# read for better resource use)

con1 <- file("C:/Statware/Rprog/Capstone/data/en_US.twitter.txt", "rb")
twit.dat <- readLines(con1, skipNul=TRUE)
close(con1) ## close the connection
length(twit.dat)

con2 <- file("C:/Statware/Rprog/Capstone/data/en_US.blogs.txt", "rb")
blog.dat <- readLines(con2, skipNul=TRUE)
close(con2) ## close the connection
length(blog.dat)

con3 <- file("C:/Statware/Rprog/Capstone/data/en_US.news.txt", "rb")
news.dat <- readLines(con3, skipNul=TRUE)
close(con3) ## close the connection
length(news.dat)

# Create training sample and write to file
# combine then sample or sample then combine? discuss!
# we are taking a 10% sample and sampling each set proportionally to its
# contribution by line count to the total body of text

#line.prop <- .2 # TESTING
set.seed(3456)
line.tot <- as.numeric(gsub('[^0-9]', '', line.count[4])) # denominator
twit.train <- sample(twit.dat,size=round((0.1)*line.prop[3]*line.tot))
blog.train <- sample(blog.dat,size=round((0.1)*line.prop[1]*line.tot))
news.train <- sample(news.dat,size=round((0.1)*line.prop[2]*line.tot))

## Save samples
writeLines(twit.train, "./data/twit.train.txt")
writeLines(blog.train, "./data/blog.train.txt")
writeLines(news.train, "./data/news.train.txt")

# Clean up
1 - (memory.size()/memory.limit()) # avail memory as %
rm(list=c("twit.dat","blog.dat","news.dat"))
gc()
1 - (memory.size()/memory.limit()) # avail memory as %
Session Information
print(sessionInfo(), locale=FALSE)
## R version 3.2.3 (2015-12-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.12.3
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5    formatR_1.3     tools_3.2.3     htmltools_0.3  
##  [5] yaml_2.1.13     stringi_1.0-1   rmarkdown_0.9.5 highr_0.5.1    
##  [9] stringr_1.0.0   digest_0.6.9    evaluate_0.8.3