Report summary
This is the first milestone report for the capstone project in the course of the finalization of JHU Data Science specialization by Coursera. As explained by instructions, the purpose of this report is “just to demonstrate that we’ve gotten used to working with the textual data and that we are on track to create our prediction algorithm.”. Hence, here you will find an initial exploratory data analysis of the data set provided by Coursera and SwiftKey, as well as ideas how the modelling task could be tackled.
We’ll be dealing with the English database which is actually a subset of a corpus called HC Corpora. The integral version of the data set which was used in this analysis can be found HERE. It contains data in four languages : English, German, Russian and Finish. There are three corpora per language which contain data generated by twitter, blogs and news feeds.
The code which was used to generate this report can be found in my GitHub repo.
Setting up the working environment and loading the data
# Loading needed packages
library(tm)
library(stringr)
library(qdap)
library(RColorBrewer)
library(SnowballC)
library(wordcloud)
library(RWeka)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(graphics)
Loading the data, i.e. corpora that will be used in the analysis
con1 <- file("data/en_US.blogs.txt", "r")
blogs <- readLines(con1, encoding = "UTF-8", skipNul = TRUE)
close(con1)
con2 <- file("data/en_US.news.txt", "r")
news <- readLines(con2, encoding = "UTF-8", skipNul = TRUE)
close(con2)
con3 <- file("data/en_US.twitter.txt", "r")
twitter <- readLines(con3, encoding = "UTF-8", skipNul = TRUE)
close(con3)
Summary statistics
The table below provides the basic summary for our corpora. As you can see data sets are a bit bulky so we will sample and aggregate 10000 lines out of each corpus for further analysis. This is actually an approach recommended by good people from JHU.
Blogs |
200.42 |
899288 |
37334131 |
News |
196.28 |
77259 |
2643969 |
Twitter |
159.36 |
2360148 |
30373583 |
Data preprocessing - cleaning the corpus
Before the corpus analysis certain preprocessing steps are usually performed. These include the following:
- Text normalization:
- text lowercasing
- removal of numbers
- removal of punctuation signs
- removal of URLs
- white space striping,
- profanity filtering
- Replacement of special characters such as emoticons, special utf-8 characters and control characters
All these operation are conducted by the help of the tm package
, which is probably the best known and the most used package for text mining in R.
Initial corpus exploration - frequent terms
First step that has to be performed in order for us to be able to explore frequency distribution of certain words, i.e. terms in our corpus is to build a term-document matrix. This matrix contains terms as rows and documents where these terms occur as columns.
# Let's build a term-document matrix out of our clean corpus
sample_tdm <- TermDocumentMatrix(clean_sample)
sample_tdm
<<TermDocumentMatrix (terms: 40350, documents: 30000)>>
Non-/sparse entries: 431951/1210068049
Sparsity : 100%
Maximal term length: 64
Weighting : term frequency (tf)
# An easy way to start analyzing the information contained in TDM is to change it into a simple matrix
sample_m <- as.matrix(sample_tdm)
#Let's check how our matrix looks like
dim(sample_m)
[1] 40350 30000
sample_m[10000:10007, 2000:2010]
Docs
Terms 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
dictatori 0 0 0 0 0 0 0 0 0 0 0
dictatorship 0 0 0 0 0 0 0 0 0 0 0
dictatorshiplit 0 0 0 0 0 0 0 0 0 0 0
dictionari 0 0 0 0 0 0 0 0 0 0 0
dictionaryush 0 0 0 0 0 0 0 0 0 0 0
didact 0 0 0 0 0 0 0 0 0 0 0
diddi 0 0 0 0 0 0 0 0 0 0 0
diddley 0 0 0 0 0 0 0 0 0 0 0
Let’s check what are the 20 most common terms in our cleaned sample corpora:
# Calculate the rowSums: term_frequency
term_frequency <- rowSums(sample_m)
# Sort term_frequency in descending order
term_frequency <- sort(term_frequency, decreasing = TRUE)
View the top 20 most common words:
term_frequency[1:20]
said will one like get just time year can make new day love work good
2935 2819 2658 2354 2310 2271 2189 2018 1976 1743 1633 1632 1464 1460 1382
say peopl now know want
1366 1350 1348 1337 1335
Plot a bar chart of the 20 most common words:
barplot(term_frequency[20:1], col = "steelblue", las = 2)

Word cloud
A word cloud is a very popular way to visualize frequency of terms (actually they are quite often overused). In a word cloud, size is usually scaled to frequency and in some cases the colors may indicate another measurement.
Let’s check what are the 50 most frequently occurring single words, i.e. unigrams in our clean corpus:
# Create word_freqs
word_freqs <- data.frame(term = names(term_frequency),
num = term_frequency)
# Print the wordcloud with the specified colors
wordcloud(word_freqs$term,
word_freqs$num,
max.words = 50,
colors = c("grey60", "darkorange", "steelblue")
)

N-Gram tokenization
Now we will shift our focus to tokens containing two and three words. This can help extract useful phrases which can lead to some additional insights or provide improved predictive attributes for construction of a machine learning algorithm.

Insights
- Stemming needs to be adjusted so we don’t have trigrams like “happi mother day” or “presid barack obama”
- Lowcasing induces a loss of information re the presence of personal names, city names, state names and alike
- Changing N-gram order, i.e. from bigram to trigram, yields to drastic decrease of observed counts
- Corpora is huge, I need more research into how to efficiently deal with it, especially in terms of building a model trained on the complete data set.
Next steps
The future work on the capstone project will be directed towards the development of proper strategy that will be used for modelling, i.e. choosing and constructing an adequate set of features, choosing and implementing the optimal prediction algorithm which will enable building fast and user friendly app.
