Abstract

The goal of the project is to download, load, and analyze several sets of text data gathered from blogs, news, and twitter.

Load packages needed to analyze data

In order to properly analyze the data, it will need to be cleaned and adjusted for usability. The tm (text mining) package allows for the analysis of the text data, stringi allows for basic statistical analysis of the text data, and the ggplot2 package allows for visualization of the analysis.

library(tm)  # text mining package
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 3.2.3
library(stringi) # character string analysis
## Warning: package 'stringi' was built under R version 3.2.5
library(ggplot2) # visualization package
## Warning: package 'ggplot2' was built under R version 3.2.4
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate

Download and Unzip data

Data is presented as a ZIP compressed archive which is available here

if(!file.exists("Coursera-SwiftKey.zip")){
    #Download the dataset
    download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
                  "Coursera-SwiftKey.zip")
    #Unzip the dataset
    unzip("Coursera-SwiftKey.zip")
}else{
    print("Dataset is already downloaded")
}
## [1] "Dataset is already downloaded"

Preprocess data

Each of the data sets needs to be loaded into R for analysis

# blog text
blogs <- readLines("en_US.blogs.txt")
# twitter text
twitter <- readLines("en_US.twitter.txt")
## Warning in readLines("en_US.twitter.txt"): line 167155 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 268547 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 1274086 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 1759032 appears to contain
## an embedded nul

The file en_US.news.txt contains a SUB-value which cannot be loaded by tm directly or readLines() from base-r. This is resolved by opening the file for BINARY read which takes care of the SUB control characters.

# news text
binary.news <- file("en_us.news.txt", open = "rb")
news <- readLines(binary.news, encoding = "UTF-8")

Exploratory Analysis

Before beginning any exploratory analysis, I want to get a good understanding of the data. Using stringi I will summarize the lines and characters, number of words per line, and visualize using qplot for each data set.

Blog text

# summary of lines and characters
stri_stats_general(blogs)
##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   208361438   171926076
# summary of word count per line
summary(stri_count_words(blogs))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   29.00   42.43   61.00 6726.00
#qplot histogram of word count per line
qplot(stri_count_words(blogs))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

News text

# summary of lines and characters
stri_stats_general(news)
##       Lines LinesNEmpty       Chars CharsNWhite 
##     1010242     1010242   203223154   169860866
# summary of word count per line
summary(stri_count_words(news))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.41   46.00 1796.00
#qplot histogram of word count per line
qplot(stri_count_words(news))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Twitter text

# summary of lines and characters
stri_stats_general(twitter)
##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162384825   134370864
# summary of word count per line
summary(stri_count_words(twitter))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     7.0    12.0    12.8    18.0    60.0
#qplot histogram of word count per line
qplot(stri_count_words(twitter))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The blog and news datasets are more similar to each other than the twitter dataset (likely due to the twitter 140 character per tweet limit).

Next I want check the size of each file.

# file size in MB
file.info("en_US.blogs.txt")$size   / 1024^2
## [1] 200.4242
file.info("en_US.news.txt")$size    / 1024^2
## [1] 196.2775
file.info("en_US.twitter.txt")$size / 1024^2
## [1] 159.3641

At approximately 200MB, the files are too large to analyze as a whole. Next I will create smaller random samples of each set for analysis to allow for exploratory analysis.

# blog sample
blog.sample <- sample(blogs, length(blogs)*0.1)
# news sample
news.sample <- sample(news, length(news)*0.1)
# twitter sample
twitter.sample <- sample(twitter, length(twitter)*0.1)

Using the tm package, I will create a corpus of the sample sets

# blogs
write.table(blog.sample, "samples/sample.blog.txt", row.names = FALSE, col.names = FALSE)
# news
write.table(news.sample, "samples/sample.news.txt", row.names = FALSE, col.names = FALSE)
# twitter
write.table(twitter.sample, "samples/sample.twitter.txt", row.names = FALSE, col.names = FALSE)
# combine tables into single sample corpus
corpus.sample <- Corpus(DirSource(directory = "samples/"))

Closing out unused environment variables to save memory

rm(binary.news)
rm(blog.sample)
rm(blogs)
rm(news)
rm(news.sample)
rm(twitter)
rm(twitter.sample)

The goal of the project is to be able to predict the most likely next word in a sentence that is being entered by a user. In doing this I will want to find the most frequently used words that follow any given word regardless of inflectional form, case, or punctuation. Also will remove most common words (e.g.conjunctions) in order to better understand what non-common words ar being used most freqently.

Starting with changing all characters to lower case.

# change case to lower
corpus.sample <- tm_map(corpus.sample, content_transformer(tolower))

The project also requires removal of bad words/foul language. Using list from bannedwordlist.com which I have already downloaded as a text file.

foul.language <- readLines("swearWords.txt")
## Warning in readLines("swearWords.txt"): incomplete final line found on
## 'swearWords.txt'
# remove foul words
corpus.sample <- tm_map(corpus.sample, removeWords, foul.language)

Continuing to clean the texts of numbers, white spaces, punctuation, and stop words.

# remove numbers
corpus.sample <- tm_map(corpus.sample, removeNumbers)
# remove white spaces
corpus.sample <- tm_map(corpus.sample, stripWhitespace)
# remove punctuation
corpus.sample <- tm_map(corpus.sample, removePunctuation)
# remove common words
corpus.sample <- tm_map(corpus.sample, removeWords, stopwords("english"))

Remove unused environment variables.

rm(foul.language)

With the Corpus of texts cleaned, next I complete review of the freqency of words. This is achieved by converting the corpus into a Term Document Matrix using the tm package.

# create term document matrix from corpus sample
corpus.tdm <- TermDocumentMatrix(corpus.sample)
# view summary statistics of term document matrix
corpus.tdm
## <<TermDocumentMatrix (terms: 237484, documents: 3)>>
## Non-/sparse entries: 332285/380167
## Sparsity           : 53%
## Maximal term length: 598
## Weighting          : term frequency (tf)
# review sample of first 30 words for each document of corpus
inspect(corpus.tdm[1:30,1:3])
## <<TermDocumentMatrix (terms: 30, documents: 3)>>
## Non-/sparse entries: 30/60
## Sparsity           : 67%
## Maximal term length: 12
## Weighting          : term frequency (tf)
## 
##                  Docs
## Terms             sample.blog.txt sample.news.txt sample.twitter.txt
##   \037decorative                0               1                  0
##   \037flavor                    0               1                  0
##   \037flavors                   0               1                  0
##   \037tenderizing               0               1                  0
##   <U+0096><U+0096><U+0096>                           0               1                  0
##   <U+0096>adapting                     0               1                  0
##   <U+0096>although                     0               1                  0
##   <U+0096>aug                          0               1                  0
##   <U+0096>clean                        0               1                  0
##   <U+0096>examination                  0               1                  0
##   <U+0096>including                    0               1                  0
##   <U+0096>lisa                         0               1                  0
##   <U+0096>make                         0               1                  0
##   <U+0096>michael                      0               1                  0
##   <U+0096>outlining                    0               1                  0
##   <U+0096>ran                          0               1                  0
##   <U+0096>stan                         0               1                  0
##   <U+0096>struck                       0               2                  0
##   <U+0097>anthony                      0               1                  0
##   <U+0097>baseball                     0               1                  0
##   <U+0097>blind                        0               1                  0
##   <U+0097>change                       0               1                  0
##   <U+0097>dont                         0               1                  0
##   <U+0097>dover                        0               1                  0
##   <U+0097>egypt                        0               1                  0
##   <U+0097>either                       0               1                  0
##   <U+0097>end                          0               1                  0
##   <U+0097>europes                      0               1                  0
##   <U+0097>even                         0               1                  0
##   <U+0097>expanding                    0               1                  0

With the Term Document Matrix of the Corpus, next I count the word freqency and sort by word frequency.

# create matrix of freqencies of each word in each document
freq <- sort(rowSums(as.matrix(corpus.tdm)), decreasing = TRUE)
# create data frame of frequencies
wf <- data.frame(word=names(freq), freq=freq)

Finally using ggplot2 I visualize the words occurring more than 10000 times in each document. The 10000 number was arbitrarily chosen as a starting position to get an understanding of frequency.

ggplot(wf[wf$freq>10000, ], aes(x=word, y=freq)) +
  geom_bar(stat="identity") + 
  theme(axis.text.x=element_text(angle = 60, hjust=1))

Conclusion

The blog and news data is similar in composition which differs from the twitter data. This is likely because of the nature of the twitter 140 character limit for messages. It appears that there are still a significant amount of special non-ASCII characters that need to be removed. This may be accomplished by removing the sparse terms of the term document matrix or by manually removing the special characters. Moving forward in the project I will be looking into N-grams to find out the relationship of each word to each order as a predictive measure of what to suggest to users of the project.