Abstract

For this project I will be working with three text data sets downloaded from HC Corpera (www.corpora.heliohost.org) that were collected from online news sources, blogs, and twitter. My goal is to clean the data so that I will be able to use it to create an algorithm that predicts natural language patterns.

Obtaining the data

The data for this project was porvided by John’s Hopkins’ Data Science Capstone course on Coursera.

library(tm)
library(Rstem)
library(SnowballC)

if(!"Coursera-swiftkey.zip" %in% dir()){
        download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", "Coursera-swiftkey.zip")
        unzip("Coursera-swiftkey.zip")
}

Having downloaded and unzipped the files we will be working with, the next step is to read the files into R. To do this, I will be using the “tm” package and storing the documents in a corpus. Before converting the data to a Corpus I will sample a collection of lines from each of the documents to reduce computaitonal strain.

myDataPath <- "final/en_US/en_US."
myFileNames <- c("twitter.txt", "news.txt", "blogs.txt")
my_Corpus <- NA
set.seed(117)
Corpora_List <- sapply(myFileNames, function(NAME){
        con <- file(paste0(myDataPath,NAME))
        temp_Vector <- readLines(con)
        close(con)
        n_Lines <- sample(1:length(temp_Vector), as.integer(length(temp_Vector)/10))
        Corpus(VectorSource(temp_Vector[n_Lines])) 
})

Data exploration

Before I begin processing the data I will look at the raw data and get a feel for how it is structured.

##         Number.of.Lines Number.of.Words
## Twitter         2360148        30373543
## News              77259         2643969
## Blogs            899288        37334131
##  [1] "and"  "are"  "but"  "for"  "from" "have" "his"  "said" "that" "the" 
## [11] "was"  "with"

From the table we see the number of lines and words in each file, and beneath it is a list of all the words that appear more then 10,000 times. As you can see, many of these words tell us little to nothing abouth the content of the files and are pretty much useless so we will have to preprocess the data to make it more usefull.

Preprocessing

To start I will transform the data to get rid of capital letters, punctuation, and numbers. Then I will remove stop words such as “the” “and” “to”, along with profanity. Finally, I will stem the words.

# Collect some information about the raw data collected
start_DTM <- sapply(Corpora_List, DocumentTermMatrix)
start_Snapshot <- Corpora_List$twitter.txt[1:5]
# Change case
Corpora_List <- sapply(myFileNames, function(NAME){tm_map(Corpora_List[[NAME]], content_transformer(tolower))})
# Delete punctuation
Corpora_List <- sapply(myFileNames, function(NAME){tm_map(Corpora_List[[NAME]], removePunctuation)})
# Delete numbers
Corpora_List <- sapply(myFileNames, function(NAME){tm_map(Corpora_List[[NAME]], removeNumbers)})
# Remove stop words
Corpora_List <- sapply(myFileNames, function(NAME){tm_map(Corpora_List[[NAME]], removeWords, stopwords(kind = "en"))})
# Remove profanity
if(!"bad-words.txt" %in% dir()){
        download.file("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt", "bad-words.txt")
}
Corpora_List <- sapply(myFileNames, function(NAME){tm_map(Corpora_List[[NAME]], removeWords, "bad-words.txt")})
# White space removal
Corpora_List <- sapply(myFileNames, function(NAME){tm_map(Corpora_List[[NAME]], stripWhitespace)})
# Stemming
#dict_Corpora <- Corpora_List
Corpora_List <- sapply(myFileNames, function(NAME){tm_map(Corpora_List[[NAME]], stemDocument)})
#Corpora_List <- sapply(myFileNames, function(NAME){tm_map(Corpora_List[[NAME]], stemCompletion, dictionary = dict_Corpora[[NAME]])})
# Recollect data after transfomations
#Corpora_List <- sapply(myFileNames, function(NAME){tm_map(Corpora_List[[NAME]], PlainTextDocument)})#######
end_DTM <- sapply(Corpora_List, DocumentTermMatrix)
end_Snapshot <- Corpora_List$twitter.txt[1:5]

By comparing the document term matrix from before and after the processing we can see that the number of documents (‘nrow’) remained the same but the number of terms (‘ncol’) decreased.

start_DTM
##          twitter.txt     news.txt       blogs.txt      
## i        Integer,2271136 Integer,193926 Integer,2466186
## j        Integer,2271136 Integer,193926 Integer,2466186
## v        Numeric,2271136 Numeric,193926 Numeric,2466186
## nrow     236014          7725           89928          
## ncol     221707          42842          217830         
## dimnames List,2          List,2         List,2
end_DTM
##          twitter.txt     news.txt       blogs.txt      
## i        Integer,1589970 Integer,138367 Integer,1694495
## j        Integer,1589970 Integer,138367 Integer,1694495
## v        Numeric,1589970 Numeric,138367 Numeric,1694495
## nrow     236014          7725           89928          
## ncol     89611           19279          90492          
## dimnames List,2          List,2         List,2

And by looking at the first few lines form each source we can see how the text has been reformatted and changed.

lapply(start_Snapshot, as.character)
## $`1`
## [1] "So I now have a team with Ellsbury, Gardner, Morse, Jennings, and Kemp on the DL. Brent Lillibridge is now my starting CF."
## 
## $`2`
## [1] "you're my gurrllll. Love ya boo."
## 
## $`3`
## [1] "love u ladies!"
## 
## $`4`
## [1] "Help spread the word! Each time someone tweets #beatcancer 2day $.05 will be donated 2 charities & !"
## 
## $`5`
## [1] "bahahaha. i love you. that's perfect."
lapply(end_Snapshot, as.character)
## $`1`
## [1] " now team ellsburi gardner mors jen kemp dl brent lillibridg now start cf"
## 
## $`2`
## [1] "your gurrllll love ya boo"
## 
## $`3`
## [1] "love u ladi"
## 
## $`4`
## [1] "help spread word time someon tweet beatcanc day will donat chariti"
## 
## $`5`
## [1] "bahahaha love that perfect"

Prediction

For my prediciton algorithm I plan on using an ngram algorithm with up to 3 strings of input to minimize computational strain. I have not started modeling at this time and plan to adjust the specifice of the model as an ongoing method in order to control the amount of strain the approach will have on the system. With the input I will run n-1, n-2,… 1 computations and generate lists of possible outputs which will then be cross referenced to determine the highest probable accuracy.