Summary

The report contains an exploratory analysis of text data for the Coursera Data Science Capstone Project. Source data includes texts from three sourses: twitter, blogs and news. There are four availiable laguages: English, Finnish, English and German. This analysis will focus on English data sources. The goal of the project is to build an application based on a predictive text model, which is capable to predict next word for given phrase. Data for training predictive model was provided by Swiftkey.

Load Data

First step is to load data file “Coursera-SwiftKey.zip” from the Internet.

# en: data URL
# ru: URL исходных данных
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

# en: download data file
# ru: загружаем файл данных
if (!dir.exists('data')) {
    dir.create('data')
}
if (!file.exists('./data/Coursera-SwiftKey.zip')) { 
    download.file(url, 
                  destfile = "./data/Coursera-SwiftKey.zip") 
}

# en: unzip data file
# ru: распаковываем архив
if (!dir.exists('./data/final')) {
    unzip('./data/Coursera-SwiftKey.zip', exdir = './data')
}

Data sources are organised by languages:

# en: data directories list
# ru: список директорий данных
dir('./data/final')
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"

Let’s read twitter, blogs and news texst in English and count lines.

# en: read English texts
# ru: читаем английские тексты
en.corp <- VCorpus(DirSource(directory = './data/final/en_US/', encoding = 'UTF-8'),
                   readerControl = list(language = 'en'))
Text source Length of English files File size, Mb
Twitter 899288 159.4
Blogs 1010242 200.4
News 2360148 196.3

An application we are going to create will not see the difference in source of phrase to finish. So, we are going to ignore information about text origin and concatenate twitter, blogs and news texts.

Data files are rather large, so we will use 10% of lines as training sample. Also, we need to clean our data:

for (i in 1:3) {
    # en: select training set
    # ru: отбираем обучающую выборку
    n <- length(en.corp[[i]]$content)
    set.seed(my.seed)
    inTrain <- sample(1:n, size = train.part * n)
    en.corp[[i]]$content <- en.corp[[i]]$content[inTrain]
    
    # en: remove urls
    # ru: убрать url из текста
    en.corp[[i]]$content <- gsub(en.corp[[i]]$content, 
                                 pattern = '(f|ht)tp(s?)://(.*)[.][a-z]+',
                                 replacement = '')
}
# en: clean text data
# ru: чистим текстовые данные
en.corp <- tm_map(en.corp, 
                  content_transformer(function(x) iconv(enc2utf8(x), 
                                                        sub = 'byte')))
en.corp <- tm_map(en.corp, removePunctuation)
en.corp <- tm_map(en.corp, removeNumbers)
en.corp <- tm_map(en.corp, content_transformer(tolower))

my.stopwords.en <- c(stopwords('en'), 'rt', 'via')
en.corp <- tm_map(en.corp, removeWords, my.stopwords.en)

en.corp <- tm_map(en.corp, stripWhitespace)

save.image('./data/corp_en.RData')

To any further analysis we will need a document-term matrices, which count words in the text corpora.

# en: remove empty and one-space documents
# ru: убираем пустые документы и те, что состоят из одного пробела
for (i in 1:3) {
    en.corp[[i]]$content <- en.corp[[i]]$content[!(en.corp[[i]]$content %in% c(' ', ''))]
}

# en: document terms matrix for blogs
# ru: матрица частот для корпуса блогов
dtm.bl <- DocumentTermMatrix(VCorpus(VectorSource(en.corp[["en_US.blogs.txt"]]$content)))
dtm.bl <- removeSparseTerms(dtm.bl, 0.999)
freq.bl <- sort(colSums(as.matrix(dtm.bl)), decreasing = TRUE)

# en: document terms matrix for news
# ru: матрица частот для корпуса новостей
dtm.nw <- DocumentTermMatrix(VCorpus(VectorSource(en.corp[["en_US.news.txt"]]$content)))
dtm.nw <- removeSparseTerms(dtm.nw, 0.999)
freq.nw <- sort(colSums(as.matrix(dtm.nw)), decreasing = TRUE)

# en: document terms matrix for twitter
# ru: матрица частот для корпуса твиттера
dtm.tw <- DocumentTermMatrix(VCorpus(VectorSource(en.corp[["en_US.twitter.txt"]]$content)))
dtm.tw <- removeSparseTerms(dtm.tw, 0.999)
freq.tw <- sort(colSums(as.matrix(dtm.tw)), decreasing = TRUE)

Data Visualisation

A word cloud plot shows words scaled in proportion to their frequency.

# en: blogs wordcloud
# ru: облако слов из блогов
wordcloud(names(freq.bl), freq.bl, max.words = 80, random.order = F)
mtext('Top-80 words from blogs', side = 3, line = 1, col = 'blue', font = 3, cex = .9)

head(freq.bl)
##   one  will  just   can  like  time 
## 12379 11186  9946  9925  9709  8725
# en: news wordcloud
# ru: облако слов из новостей
wordcloud(names(freq.nw), freq.nw, max.words = 80, random.order = F)
mtext('Top-80 words from news', side = 3, line = 1, col = 'blue', font = 3, cex = .9)

head(freq.nw)
##  said  will   one   new  also   two 
## 24972 10719  8114  7060  5926  5786
# en: twitter wordcloud
# ru: облако слов из твиттера
wordcloud(names(freq.tw), freq.tw, max.words = 80, random.order = F)
mtext('Top-800 words from twitter', side = 3, line = 1, col = 'blue', font = 3, cex = .9)

head(freq.tw)
##  just  like   get  love  good  will 
## 15056 12089 11114 10444  9957  9518

As one may assume from the plots above, news texts have a special style, and as a result, fewer specific words are repeated here more often. But in general, the text corpora from news, blogs and twitter contain 3223, 3072 and 1037 unique words respectively. Thus, the dictionary of news texts is the most diverse. The dictionary of microblogging of Twitter is the most limited, which is explained by the allowable length of texts. For further analysis it is important to answer the question, is the percentage of common words in the three sources of texts large enough to treat this sources equally?

common.en <- intersect(intersect(names(freq.bl), names(freq.nw)), names(freq.tw))
common.en <- data.frame(word = common.en,
                        freq.bl = freq.bl[common.en],
                        freq.nw = freq.nw[common.en],
                        freq.tw = freq.tw[common.en])
common.en$rank.freq.bl <- rank(common.en$freq.bl)
common.en$rank.freq.nw <- rank(common.en$freq.nw)
common.en$rank.freq.tw <- rank(common.en$freq.tw)

# en: concordance analysis for ranks of common words
# ru: ранговый коэффициент конкордации Кендалла для общих слов в кодексах
kndl <- kendall.global(common.en[, grep(colnames(common.en), pattern = 'rank.', value = T)])
kndl
## $Concordance_analysis
##                 Group.1
## W          7.317040e-01
## F          5.454454e+00
## Prob.F    1.242922e-199
## Chi2       1.927308e+03
## Prob.perm  1.000000e-03
## 
## attr(,"class")
## [1] "kendall.global"

879 words are found in all three texts corpora: 27.3%, 84.8% and 28.6% for news, twitter and blogs respectfully. Kendall’s concordance coefficient between the ranks of words by their occurrence in the three types of texts is significant and shows a positive connection: 0.73. This means that all three types of texts can be used as a general sample for training a model.

What next

The predictive text model will be based on frequencies of occurrence of n-grams. According to Wikipedia, “an n-gram is a contiguous sequence of n items from a given sequence of text or speech”. In order to create a model we need to accomplish a number of steps, using our corpus:
* build n-grams frequency matrices;
* find associations between words and n-grams, using the frequency matrices.