Data for the project was provided by Coursera in partnership with Swiftkey. The data is from a corpus called HC Corpora.
The data set was downloaded from the course website and can be found at this link: (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip)
The data set contains four different language sets: English, Russian, Finish and German. My analysis is focused on the English set only.
Dataset Summary
The *en_US folder contains 3 files en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt.
Since all three files are very large, for my initial analysis, I took a small random subset of 100 records from each file.
Building the Corpus
A Corpus is the format text is often stored in text mining processes. To start my text-mining process, using the tm package, I created a Corpus object from the three sample files (twitter.txt, blogs.txt, and news.txt).
docs <- Corpus(DirSource("./sample/"))
summary(docs)
## Length Class Mode
## blogs.txt 2 PlainTextDocument list
## news.txt 2 PlainTextDocument list
## twitter.txt 2 PlainTextDocument list
Cleaning the Corpus
Next, I performed some pre-processing to clean-up and prepare the row text for analysis. The following transformations were performed using the tm package.
Tokenization and Building Document Term Matrices
The next step in text mining is tokenization (generating n-grams). This step involves breaking down the text into meaningful units such as words and phrases.
Using the TM and Rweka packages, the corpus was tokenized and Document Term Matrix (DTM’s) were created.
## create unigrams
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
unidtm <- DocumentTermMatrix(docs, control = list(tokenize = UnigramTokenizer))
## create bi-grams
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bidtm <- DocumentTermMatrix(docs, control = list(tokenize = BigramTokenizer))
## create tri-grams
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tridtm <- DocumentTermMatrix(docs, control = list(tokenize = TrigramTokenizer))
Inspecting the DTM
The DTM of my sample data contains 2933 Terms (distinct words).
Taking a peek at the first few terms in the DTM:
inspect(unidtm[1:3, 1:6])
## <<DocumentTermMatrix (documents: 3, terms: 6)>>
## Non-/sparse entries: 7/11
## Sparsity : 61%
## Maximal term length: 9
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs aaa able academy accents according accounts
## blogs.txt 1 7 5 1 1 1
## news.txt 0 3 0 0 0 0
## twitter.txt 0 0 0 0 0 0
For further exploration of the data, I decided to leverage the standard ‘familiar’ tools such as the dplyr Package. In order to do that, I converted the DTM into a data frame (with one token per document per row). This was done using the tidytext Package which provides functions to convert a DTM format to a tidy data frame and vice versa.
## convert dtm to data frame
unidtm_df <- tidy(unidtm)
head(unidtm_df)
## # A tibble: 6 × 3
## document term count
## <chr> <chr> <dbl>
## 1 blogs.txt aaa 1
## 2 blogs.txt able 7
## 3 blogs.txt academy 5
## 4 blogs.txt accents 1
## 5 blogs.txt according 1
## 6 blogs.txt accounts 1
#Summarize to get word frequencies.
wordfreq_df <- unidtm_df %>% count(term, wt = count, sort = TRUE)
The plots below summarize my analysis of word frequencies.
Some observations