Our goal is to produce a functioning data product to predict the next likely word given an initial phrase. We began this task on March 7th, this report outlines progress to date and planned steps for successful completion by April 24th. We were supplied with three large text files from three internet sources: tweets, blogs and news, and tasked with completing the following for this report:
The results of these tasks are shown below with appropriate discussion. The final section lays out the remaing major steps to finish the project. Two of those steps are to better understand memory use and data size limitations and to select which of several R packages to use in building and modeling our data. An appendix with references and code completes the report.
Basic metrics about the raw files are shown below. We use system functions to produce the raw numbers and R functions to build the tables. For now we ignore issues such as “what is a word?”
In order to work with the data more efficiently we built samples and chose to select them proportionally based on their contribution to total lines. In addition we limited it to a 10% sample. The second table shows the sample metrics.
The twitter sample is still unweidly so our file read function for this report currently reads only the first 100,000*(%ofTotLines) from each file.
| Source | Num Lines | % of Tot Lines | Words per Line | Longest Line |
|---|---|---|---|---|
| en_US.blogs.txt | 899,288 | 21.1 % | 41.4 | 40,832 |
| en_US.news.txt | 1,010,242 | 23.7 % | 34.0 | 11,384 |
| en_US.twitter.txt | 2,360,148 | 55.3 % | 12.9 | 140 |
| total | 4,269,678 | 100.0 % | 23.9 | 40,832 |
Total words can be computed from lines and words per line.
| Source | Num Lines | % of Tot Lines | Words per Line | Longest Line |
|---|---|---|---|---|
| blog.train.txt | 89,929 | 21.1 % | 41.4 | 37,166 |
| news.train.txt | 101,024 | 23.7 % | 33.9 | 2,625 |
| twit.train.txt | 236,015 | 55.3 % | 12.8 | 140 |
| total | 426,968 | 100.0 % | 23.9 | 37,166 |
The fact that twitter has the most relative lines seems logical and that news and blog have similar numbers of lines. Also, we aren’t surprised to see blogs have a higher average words per line. Line length is something to be explored in a more advanced level project. However, it is good to see that the max twitter line is 140.
Currently we are working with each file independently trying to understand how the different sources look in relation to each other. A language prediction machine needs to be able to handle the unexpected so we will need to combine all of the data. But it is also fun to imagine to be able to predict for someone on a formal or informal basis. News might be better for formal language and tweets could be a gold mine of slang.
To work with text in a meaningful way we need to restructure it. Listing and counting occurences of words is a logical and “easy” first step (maybe not fun without a computer). Hidden code loads the libraries and data samples we will work with for the rest of the report.
Before using some of the more formal tools we go through the text and remove foreign characters by keeping only all english alpha numeric (a-z, A-Z and 0-9) as well as standard punctuation ( ’,-:;.!? ).
We are using the R package tm at this time. A simple Corpus() command turned our text data into this special structure. Now we use tm_map and many built in operations. After some experimentation cleaning is done as follows:
Below we print two lines from the twitter file and one line each from the blog and news files. First is the line as we imported it and second is the corpus entry for that line.
## Is his eardrum, on his fifty he be like What? I can't hear you. lol
## list(content = "is eardrum fifti like what i hear lol", meta = list(author = character(0), datetimestamp = list(sec = 13.0866029262543, min = 25, hour = 23, mday = 19, mon = 2, year = 116, wday = 6, yday = 78, isdst = 0), description = character(0), heading = character(0), id = "4", language = "en", origin = character(0)))
After the first two hashes is the raw tweet: Is his eardrum, on his fifty he be like What? I can’t hear you. lol
After the second set of hashes, after “content =” is the cleaned tweet: is eardrum fifti like what i hear lol
As can be seen we have removed the punctuation, changed the case, removed words such as “on” and “be” and changed “fifty” to the stemmed “fifti.”
## The National Immigration Law Center, an immigrant advocacy group, says mandating its use nationwide would force millions of people to correct false information in the system or lose their jobs. It would put nearly 800,000 people out of work because of database errors. What's more, E-Verify cannot detect identity fraud.
## list(content = "the nation immigr law center immigr advocaci group say mandat use nationwid forc million peopl correct fals inform system lose job it put near peopl work databas error what everifi detect ident fraud", meta = list(author = character(0), datetimestamp = list(sec = 19.4614140987396, min = 25, hour = 23, mday = 19, mon = 2, year = 116, wday = 6, yday = 78, isdst = 0), description = character(0), heading = character(0), id = "100", language = "en", origin = character(0)))
This is an example from the news file. Note that the number 800,000 does not occur in the cleaned version.
Visualizing our information can help us understand it better, especially something as unweildy as chunk of text. From the corpus we create the term document matrix, which is what it sounds like, terms are rows, document ids are columns and the entries are counts of those terms appearing in the documents.
Such objects are also likely to be highly sparse and so we take some steps to reduce the matrix to something more usable.
———- Twitter ————————– Blog ————————- News (exc. “said”) —–
The word cloud gives a quick visual of the relative frequency of words typically through size and color. In trying to reduce the upper and lower margins of the chart we found discussion of the shortcomings of word clouds (see for example references ). Exploratory analysis is about trying to look at our data and visualizations are a great way to do that.
Here the goal was just to get a sense of the common words in each source to literally see which stand out as being the same or different. “Will”, “just”, “time”, “peopl” and “like” appear in all three. From twitter we note that in this sample a positive sentiment stands out: “love”, “thank” and “good.” In some of our exploration this wasn’t the case. Many words appear frequently in two of the three sources e.g. “year”, “work”. Finally, it was notable that “citi” appears in the top from the news source and the words “play” and “game” were unexpected until we remembered the sports news.
Frankly, a table of words by source might convey things better–for one we would have the counts–recall each source does not contribute equally to the sample. The link given above shows an interesting modification to the word cloud.
It should be noted that we don’t know if the frequent occurence of “like” is due to the existence of the “like” button. It is likely. In addition, we need to check the news for author, date and other tagline information that should be accounted for.
Above we created unigrams collections of single words. Now we create bigrams, collections of word pairs. For the next step in our project we will need to create trigrams and then use the n-grams we have created to build our initial prediction algorithm.
The example words given highlight issues when working with language (versus raw text). Is “will” a legal document or “I will get the milk”? Does the news tell us about “you have to play the game to get ahead” or “that was the play of the game!”
———- Twitter ————————– Blog ————————– News ——–
Although we took some steps to clean out “i ve” and most of the “said i” and “i said” type pairs we still have a number of terms that seem redundant or potentially of little value. In hindsight, leaving in certain punctuation marks might be worthwhile or doing more word transformations, so that “I’ve” becomes “I have” might be in order.
The occurence of the pronoun “I” isn’t to be wondered at in tweets and blog posts. In our news corpus we are still plagued with words that most likely come from sources quoted in the story. Really, we just wanted some interesting word pairs to look at, as shown by printing the last 10 pairs from each source below.
## will see will take year old great weekend haha i
## 23 23 23 22 22
## i bet just watch much love whi i will never
## 22 22 22 22 22
## s like say someth the fact the stori this will wait see
## 24 24 24 24 24 24
## work hard world i yesterday i you see
## 24 24 24 24
## obama administr pretti good the next two day
## 57 57 57 57
## unemploy rate want make offic said said im
## 57 57 56 56
## said today tri make
## 56 56
Working with a body of text is different from the usual row and column data. Fortunately, for our task we can turn this data into a structure we can understand–n-grams and document-term or term-document matrices. The next step is to combine the data from our sources and build a word prediction model.
Some of our results so far leave a little to be desired but most of them were obtained in the context of putting together this report. Certainly we have a better understanding of what we have to work with, the steps to clean and “tokenize” the data into a form we will use to build a model that appears to understand phrases.
In order to meet the standards and due date for the project we plan to:
We hope to get through the first 3 above before the end of week three in following the syllabus. Getting something in place and refining it is better than trying to create something perfect.
#####################################################
## Data Science Capstone: Milestone Report
## milefuncts.R
## Jesse Sharp
## 2016-03-19
## functions used in CDSCmilestone.Rmd
#####################################################
# Read Samples
loadsamp <- function(x,pt){con1 <- file(x, "rb")
y <- readLines(con1, 100000*pt, skipNul=TRUE) # read 100,000 lines
close(con1) #close the connection
return(y)
}
# clean text
charfixer <- function(x){x <- gsub("[^a-zA-Z0-9',-:;.!?]", " ", x)}
wordfixer <- function(x){x <- gsub("He'll", "he will", x)
x <- gsub("he'll", "he will", x)
x <- gsub("i'll", "i will", x)
x <- gsub("I'll", "i will", x)
x <- gsub("I've", "i have", x)
x <- gsub("i've", "i have", x)
x <- gsub("it's", "its", x) # it is or it's?
return(x)
}
## bad words
conb <- file("C:/StatWare/Rprog/Capstone/data/badwords.txt", "r")
badwords <- readLines(conb, skipNul=TRUE)
close(conb)
## corpus cleaning function
corpCleaner <- function(x){
x <- tm_map(x, removeWords, stopwords("english"))
x <- tm_map(x, content_transformer(tolower))
x <- tm_map(x, removePunctuation)
x <- tm_map(x, removeNumbers)
x <- tm_map(x, stripWhitespace)
x <- tm_map(x, removeWords, badwords)
# d. stemming - optional
x <- tm_map(x, stemDocument)
return(x)
}
# Printing samples
prntdoc <- function(x,y,z) {
#see the contents of xth document
RAWTXT <- x[z]
corp1 <- as.character(y$content[z])
# return(print(rbind(RAWTXT, corp1)))
return(writeLines(rbind(RAWTXT, corp1)))
}
# Multi-plot function
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
library(grid)
# Make a list from the ... arguments and plotlist
plots <- c(list(...), plotlist)
numPlots = length(plots)
# If layout is NULL, then use 'cols' to determine layout
if (is.null(layout)) {
# Make the panel
# ncol: Number of columns of plots
# nrow: Number of rows needed, calculated from # of cols
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots==1) {
print(plots[[1]])
} else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
# Make each plot, in the correct location
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}
# Cleveland plot credit: https://rpubs.com/escott8908/RGC_Ch3_Gar_Graphs
plotTop <- function(z){x <- ggplot(z, aes(x=log(counts), y=reorder(bigram, counts))) +
geom_point(size = 3) +
theme_bw() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(color='grey60', linetype='dashed'))
return(x)
}
#####################################################
## Data Science Capstone: Milestone Report
## appendix.R
## Jesse Sharp
## 2016-03-19
## data reading/sampling used in CDSCmilestone.Rmd
#####################################################
# Appendix R Code
# packages we have installed so far
install.packages("tm")
install.packages("worldcloud")
install.packages("SnowballC")
install.packages("rJava")
install.packages("RWeka")
# Task: confirm ability to read the data
datURL <- ("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-Swiftkey.zip")
datFile <- ("../Coursera-Swiftkey.zip")
getwd()
setwd("C:/StatWare/Rprog/Capstone")
#datFile1 <- ("./data/filename")
if (!file.exists("./data")) {
dir.create("./data")
}
if (!file.exists(datFile)) {
download.file(datURL, destfile=datFile)
unzip(datFile)
}
# Read in all the data - use "rb" - binary is a good default even for text
# (instead of being more complete and writing a function to sample the data as it is
# read for better resource use)
con1 <- file("C:/Statware/Rprog/Capstone/data/en_US.twitter.txt", "rb")
twit.dat <- readLines(con1, skipNul=TRUE)
close(con1) ## close the connection
length(twit.dat)
con2 <- file("C:/Statware/Rprog/Capstone/data/en_US.blogs.txt", "rb")
blog.dat <- readLines(con2, skipNul=TRUE)
close(con2) ## close the connection
length(blog.dat)
con3 <- file("C:/Statware/Rprog/Capstone/data/en_US.news.txt", "rb")
news.dat <- readLines(con3, skipNul=TRUE)
close(con3) ## close the connection
length(news.dat)
# Create training sample and write to file
# combine then sample or sample then combine? discuss!
# we are taking a 10% sample and sampling each set proportionally to its
# contribution by line count to the total body of text
#line.prop <- .2 # TESTING
set.seed(3456)
line.tot <- as.numeric(gsub('[^0-9]', '', line.count[4])) # denominator
twit.train <- sample(twit.dat,size=round((0.1)*line.prop[3]*line.tot))
blog.train <- sample(blog.dat,size=round((0.1)*line.prop[1]*line.tot))
news.train <- sample(news.dat,size=round((0.1)*line.prop[2]*line.tot))
## Save samples
writeLines(twit.train, "./data/twit.train.txt")
writeLines(blog.train, "./data/blog.train.txt")
writeLines(news.train, "./data/news.train.txt")
# Clean up
1 - (memory.size()/memory.limit()) # avail memory as %
rm(list=c("twit.dat","blog.dat","news.dat"))
gc()
1 - (memory.size()/memory.limit()) # avail memory as %
print(sessionInfo(), locale=FALSE)
## R version 3.2.3 (2015-12-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.12.3
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5 formatR_1.3 tools_3.2.3 htmltools_0.3
## [5] yaml_2.1.13 stringi_1.0-1 rmarkdown_0.9.5 highr_0.5.1
## [9] stringr_1.0.0 digest_0.6.9 evaluate_0.8.3