Introduction

This is the milestone report for the Exploratory Analysis section of the Coursera Data Science Capstone project. The goal of this capstone project is to analyze a large corpus of text documents to discover the structure in the data and how words are put together. By cleaning and analyzing text data, I will then build a predictive text model.

This report will summarize the findings from the exploratory analysis carried out on data for the capstone project. I have included some major statistical findings and you will find below, graphs that are representative of the data set of words.

library(R.utils)
library(stringi)
library(ggplot2)
library(ngram)
library(tm)
library(RWeka)
library(tau)
library(wordcloud)

Data Processing

Loading the Dataset

setwd("/Users/sandraezidiegwu/Documents/Data Science/Capstone Project/final/en_US/")
swiftkey <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(swiftkey, destfile = "./SwiftKey", method = "curl")
unzip("SwiftKey")

blog <- readLines("en_US.blogs.txt", encoding = "UTF-8")
news <- readLines("en_US.news.txt", encoding = "UTF-8")
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8")

Sample Data Collection

To reduce processing time, we will grab samples of the data.

set.seed(124)
blog.s <- sample(blog, 1000)
news.s <- sample(news, 1000)
twitter.s <- sample(twitter, 1000)

Combine Blog/News/Twitter Word Samples into 1 File

words <- c(blog.s, news.s, twitter.s)
write(words, file = 'words.txt')
words <- readLines("words.txt")

Summary Statistics

##        File Name File Size (mb) Line Count Word Count
## 1          Blogs         248.49     899288   37334131
## 2           News         249.63    1010242   34372530
## 3        Twitter         301.40    2360148   30373543
## 4 Sample Summary           0.66       3000      89476

Summary Statistics Visualization

Text Corpus and Cleaning

words <- gsub("http.*\\s*", "", words)
words <- gsub("[[:punct:]]", "", words)
words <- gsub("[[:digit:]]", "", words)
words <- gsub("http[[:alnum:]]*", "", words)
words <- gsub("^\\s+|\\s+$", " ", words)

#Convert to Lower Case
words <- tolower(words)

#Remove Stopwords
words <- removeWords(words, stopwords(kind = "en"))

#Remove Profanity
words.profanity <- readLines("profanity.csv")
words <- removeWords(words, words.profanity)

#Create Corpus
words.corpus <- Corpus(VectorSource(words), readerControl = list(reader =  readPlain, language = 'en'))

Tokenization

Using the ‘tau’ package, I will create functions that tokenize the text into unigrams, bigrams and trigrams

unigramtoken <- function(x, n=1) return(rownames(as.data.frame(unclass(textcnt(x$content,method="string",n=n)))))
bigramtoken <- function(x, n=2) return(rownames(as.data.frame(unclass(textcnt(x$content,method="string",n=n)))))
trigramtoken <- function(x, n=3) return(rownames(as.data.frame(unclass(textcnt(x$content,method="string",n=n)))))

Top 10 Unigrams/Bigrams/Trigrams

tdm.unigram <- TermDocumentMatrix(words.corpus, control=list(tokenize=unigramtoken))
freq1 <- sort(rowSums(as.matrix(tdm.unigram)), decreasing = T)
freq.df1 <- data.frame(word=names(freq1), freq=freq1)
head(freq.df1, 10)
##      word freq
## said said  242
## will will  231
## one   one  215
## just just  212
## like like  184
## can   can  168
## get   get  152
## time time  141
## new   new  139
## now   now  135
tdm.bigram <- TermDocumentMatrix(words.corpus, control=list(tokenize=bigramtoken))
freq2 <- sort(rowSums(as.matrix(tdm.bigram)), decreasing = T)
freq.df2 <- data.frame(word=names(freq2), freq=freq2)
head(freq.df2, 10)
##                    word freq
## high school high school   18
## new york       new york   16
## im going       im going   14
## last year     last year   14
## right now     right now   13
## st louis       st louis   13
## years ago     years ago   13
## first time   first time   12
## dont know     dont know   11
## dont think   dont think   10
tdm.trigram <- TermDocumentMatrix(words.corpus, control=list(tokenize=trigramtoken))
freq3 <- sort(rowSums(as.matrix(tdm.trigram)), decreasing = T)
freq.df3 <- data.frame(word=names(freq3), freq=freq3)
head(freq.df3, 10)
##                                            word freq
## president barack obama   president barack obama    5
## new york times                   new york times    4
## im hard time                       im hard time    3
## also great way                   also great way    2
## can honestly say               can honestly say    2
## cant imagine youre           cant imagine youre    2
## cant wait share                 cant wait share    2
## columbia university new columbia university new    2
## couple weeks ago               couple weeks ago    2
## couple years ago               couple years ago    2

Wordcloud of most frequent words used in dataset

wordcloud(freq.df1$word, freq.df1$freq, scale = c(4,0.8), max.words = 40, random.order = F, random.color = TRUE, colors = brewer.pal(8,'Dark2'))

Findings and Further Work

  • The twitter words string count vary differently from the news and blogs dues to the 140 character limitations set my twitter for each post.
  • Data processing is terribly slow due to the size of the word files. It was necessary to reduce the size by taking samples.
  • The lack of stopwords are needed for accurate prediction, in my following work, I will include stopwords.
  • The next step would be building a prediction algorithm for my application.