Synopsis

Getting and Cleaning the Data

Data for the project was provided by Coursera in partnership with Swiftkey. The data is from a corpus called HC Corpora.

The data set was downloaded from the course website and can be found at this link: (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip)

The data set contains four different language sets: English, Russian, Finish and German. My analysis is focused on the English set only.

Dataset Summary

The *en_US folder contains 3 files en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt.

Since all three files are very large, for my initial analysis, I took a small random subset of 100 records from each file.

Building the Corpus

A Corpus is the format text is often stored in text mining processes. To start my text-mining process, using the tm package, I created a Corpus object from the three sample files (twitter.txt, blogs.txt, and news.txt).

docs <- Corpus(DirSource("./sample/"))
summary(docs)
##             Length Class             Mode
## blogs.txt   2      PlainTextDocument list
## news.txt    2      PlainTextDocument list
## twitter.txt 2      PlainTextDocument list

Cleaning the Corpus

Next, I performed some pre-processing to clean-up and prepare the row text for analysis. The following transformations were performed using the tm package.

Tokenization and Building Document Term Matrices

The next step in text mining is tokenization (generating n-grams). This step involves breaking down the text into meaningful units such as words and phrases.

Using the TM and Rweka packages, the corpus was tokenized and Document Term Matrix (DTM’s) were created.

## create unigrams 
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
unidtm <- DocumentTermMatrix(docs, control = list(tokenize = UnigramTokenizer))

## create bi-grams
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bidtm <- DocumentTermMatrix(docs, control = list(tokenize = BigramTokenizer))

## create tri-grams
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tridtm <- DocumentTermMatrix(docs, control = list(tokenize = TrigramTokenizer))

Inspecting the DTM

The DTM of my sample data contains 2933 Terms (distinct words).

Taking a peek at the first few terms in the DTM:

inspect(unidtm[1:3, 1:6])
## <<DocumentTermMatrix (documents: 3, terms: 6)>>
## Non-/sparse entries: 7/11
## Sparsity           : 61%
## Maximal term length: 9
## Weighting          : term frequency (tf)
## Sample             :
##              Terms
## Docs          aaa able academy accents according accounts
##   blogs.txt     1    7       5       1         1        1
##   news.txt      0    3       0       0         0        0
##   twitter.txt   0    0       0       0         0        0

Exploratory Data Analysis

For further exploration of the data, I decided to leverage the standard ‘familiar’ tools such as the dplyr Package. In order to do that, I converted the DTM into a data frame (with one token per document per row). This was done using the tidytext Package which provides functions to convert a DTM format to a tidy data frame and vice versa.

## convert dtm to data frame
unidtm_df <- tidy(unidtm)
head(unidtm_df)
## # A tibble: 6 × 3
##    document      term count
##       <chr>     <chr> <dbl>
## 1 blogs.txt       aaa     1
## 2 blogs.txt      able     7
## 3 blogs.txt   academy     5
## 4 blogs.txt   accents     1
## 5 blogs.txt according     1
## 6 blogs.txt  accounts     1
#Summarize to get word frequencies.
wordfreq_df <- unidtm_df %>% count(term, wt = count, sort = TRUE)

The plots below summarize my analysis of word frequencies.

Some observations

Plans for further development